Page MenuHomePhabricator

Assist with maintaining aggregate values in numerical tables
Open, Needs TriagePublic

Description

From: https://www.facebook.com/groups/wikipediaweekly/permalink/2735507239830423/

@Doc_James wrote:
"Keeping our COVID19 numbers uptodate is a pain. We need a "SUM" tool for tables within Wikimedia. https://en.wikipedia.org/wiki/Template:2019–20_coronavirus_outbreak_data"

Event Timeline

With T247877 closed you can now see the sum and average when selecting multiple numeric cells in VE.

This would be a relatively straightforward task for a Lua script if the data were in some kind of machine-readable format. Filed T250065: Store tabular data in a format that's machine-readable and can be shared between wikis about that.

This depends on the data storage issue @eprodromou and @Tgr are working on

Abit added a subscriber: kaldari.

Or something could be done for a few specific tables. @kaldari looking into it (no longer than a day), will report back.

if the data were in some kind of machine-readable format

Yes, the hardest part of this is formatting and parsing numbers based on local formats, e.g. 1,000.00 vs 1.000,00

Yes, the hardest part of this is formatting and parsing numbers based on local formats, e.g. 1,000.00 vs 1.000,00

Formatting seems like a solved problem (formatnum etc). Parsing is hard in general; if the assumption is that data is stored in English Wikipedia and reused by other wikis so only the enwiki format needs to be understood, then it's manageable, I think.

Also Language::parseFormattedNumber could be exposed to Lua or whatever does the parsing. That should work assuming the numbers themselves are formatted with formatnum and not in some unreliable manual way.

@Doc_James - If the assumption is that the data is in Wikitext (and not JSON or something else), it seems like the best solution to this problem would be to generate the entire table from a single template, and have a Lua module calculate the totals based on the parameters passed to the template for each country. The big downside to this solution is that editors would no longer be able to use the VisualEditor table editor to edit the country data. And like Ed and Gergo mention above, dealing with number formatting is going to be a problem for any potential solution.

A cleaner long-term solution would be to keep all the data in JSON, but that would require tackling T248897 and also building some kind of transcludable table output in either the JsonConfig or Graph extension.

Two other options:

  • Write a wikitable parser in Lua. (This sounds horrible but is actually only mildly horrible. Find the start and end of the table body (these don't change so could be shown by a marker comment), tokenize for |, ||, |-, <ref>, </ref>, <!--, -->, and process the stream with a simple state machine (in the table / in a comment / in a reference). There are a number of other states in wikitext, like inside a template or another extenstion tag, but it's reasonable to assume those won't ever be used in these tables.) For every wikitable, write a module which is just an invocation of the parser on that table. Use mw.loadData to load that module (that ensures parsing is only done once per request). Provide something along the lines of {{#invoke:tablefunctions|sum|title=Template:SomeDataTable|column=5}} that calls the appropriate module, gets the parsed table and sums up the data. (That can be safely used inside the table - it doesn't parse the table, so there's no recursion.)
  • Use a machine-readable format which has sane diffs (JSON, CSV, Lua table). Accept it won't be editable via VisualEditor, instead create some one-off editing interface on Toolforge with OAuth-based edits. There are probably free Javascript libraries that provide spreadsheet-like behavior. References complicate things but can maybe hacked in somehow.

The big downside to this solution is that editors would no longer be able to use the VisualEditor table editor to edit the country data.

Use a machine-readable format which has sane diffs (JSON, CSV, Lua table). Accept it won't be editable via VisualEditor

This will push more content to become more inaccessible to new contributors using VE (about 40% of first 100 edits are with VE), which I don't think is a good thing.

Write a wikitable parser in Lua.

Instead of re-implementing one, maybe this could be exposed to Lua by the parser?

This will push more content to become more inaccessible to new contributors using VE (about 40% of first 100 edits are with VE), which I don't think is a good thing.

...unless T248897 creates and equally user-friendly table data editor that VE can redirect users to.

Use a machine-readable format which has sane diffs (JSON, CSV, Lua table). Accept it won't be editable via VisualEditor

This will push more content to become more inaccessible to new contributors using VE (about 40% of first 100 edits are with VE), which I don't think is a good thing.

If there's a dedicated editing tool, it wouldn't really matter.

...unless T248897 creates and equally user-friendly table data editor that VE can redirect users to.

That's about editing JSON pages on Commons, though. Currently the data is in wikitables in templates on enwiki. At a minimum, enwiki would have to enable tabular data pages, because enwiki editors are unlikely to buy into maintaining that data on a foreign wiki. And then we'd need a way to convert those JSON tables back into wikitables, and deal with granular source notations (cf T250919: Add row/cell annotations to tabular data). A better JSON editor would be cool, but I doubt it would help with short-term covid-19 data issues.

Write a wikitable parser in Lua.

Instead of re-implementing one, maybe this could be exposed to Lua by the parser?

Fully parsing the page seems like a lot of overhead and could result in recursion; and parser code is not easy to interact with. I have considered exposing tables during the course of normal parsing as some kind of metadata (extension data in the parser cache, for example) but tables can contain complex wikitext (including other tables) so the parser probably couldn't do anything better than treating the contents as pure wikitext or pure HTML (and even that might not be easy, given its reliance on regular expressions). So maybe that would be doable for numbers but references probably wouldn't survive the process.

(This sounds horrible but is actually only mildly horrible. Find the start and end of the table body (these don't change so could be shown by a marker comment), tokenize for |, ||, |-, <ref>, </ref>, <!--, -->, and process the stream with a simple state machine (in the table / in a comment / in a reference). There are a number of other states in wikitext, like inside a template or another extenstion tag, but it's reasonable to assume those won't ever be used in these tables.)

I'm concerned that there's sufficient edge-cases in common table usage that it'd get difficult for us. Most difficult for this: tables can have cells with weird colspans/rowspans, or can be assembled from templates that provide entire rows/cells.

From a sum perspective, there's also lots of tables where the values aren't actually in a convenient format for us to sum up. E.g. in the linked coronavirus table, the numbers are actually e.g. {{formatnum:{{Sum|126787<!-- US overall aggregate -->|0<!-- Puerto Rico -->|-128<!-- Guam -->|-12<!-- Northern Mariana Islands -->|-51<!-- U.S. Virgin Islands -->}}}}. Again, the lua parser would have to get pretty smart about templates in order to do something useful here.

Obviously, we can say "this tool works on very tightly defined tables, where you can't use lots of common features, including {{formatnum}}"... but if we're restricting it that much then there's not really much of an advantage to sticking with wikitext rather than just going to JSON and having a structured editor there.

I'm concerned that there's sufficient edge-cases in common table usage that it'd get difficult for us. Most difficult for this: tables can have cells with weird colspans/rowspans, or can be assembled from templates that provide entire rows/cells.

Colspan/rowspan is unusual in data tables. You are right about templates (some other tables I have checked had a relatively straighforward data format, though). Preprocessing the wikitext is a way to deal with that, but it makes parsing much more expensive and also opens up the possibility for recursion.

Obviously, we can say "this tool works on very tightly defined tables, where you can't use lots of common features, including {{formatnum}}"... but if we're restricting it that much then there's not really much of an advantage to sticking with wikitext rather than just going to JSON and having a structured editor there.

The advantage is that you don't need to convince hundreds of editors to adapt a new workflow involving a tool that has nowhere near feature parity. JSON is probably the sane long-term solution, I'm just skeptical about it being feasible in the short term.

Tgr removed Tgr as the assignee of this task.Aug 23 2022, 6:01 AM