tl;dr
Wikidata is a decent place for storing ontology-type data (ie. key facts about somewhat notable entities), but we don't have a good place for storing very specific and detailed data (such as time series).
Problem statement
The COVID-19 related English Wikipedia articles contain various very detailed data (mostly, but not exclusively time series):
- daily number of total cases / active cases / recovered / dead (globally or per country, sometimes below the country level too)
- daily number of tests (globally or per country)
- start/end date of various restrictions like lockdowns
- number of students affected by school closures (per country)
These also tend to come with very detailed sourcing (ie. different data points come from different sources, sometimes contradictory sources or other commentary).
Currently these are handled with free-form, hand-maintained templates, which contain wikitables to be inserted into the relevant articles directly (e.g. current case count by country, daily case counts by country, US states cases, UK region cases, quarantine times, number of students affected by school closures), with further hand-maintained templates to access the same data in a different format (current stats). There are also some machine-updated templates that mirror data from some official source, to be used for visualization (case count maps).
This has some benefits:
- Everything is on English Wikipedia so lines of authority are clear, and the quality is high since sourcing, data quality and dealing with vandalism and disinformation are a core competency of enwiki.
- It is reasonably easy, starting from an article, to find how to edit the data.
- Editing is somewhat sane, with VisualEditor being available for making changes. ("Somewhat" because the tables are large and VE becomes sluggish; but it's still better than trying to find your way in hundreds of lines of raw text, and it's an interface editors are already familiar with.)
But it has lots of disadvantages too:
- Diffs are not great. Visual diffs are broken completely (that's T211897: Visual Diffs: Improve table diffing, presumably - the calculation times out so VE just shows the table with no changes, even for structurally simple changes which just change a cell value) and text diffs in a huge table are just not terribly helpful (example).
- The data is not machine-readable (the tables can probably be scraped with some effort, but even that's terribly fragile).
- The data is not available on other Wikipedias, so they can't easily benefit from all the hard work of enwiki editors, and on many of them the data is significantly outdated.
- The data cannot be handled by wikitext logic (such as Lua modules), leading to maintenance problems like the difficulty of keeping row/column totals in sync (T247875: Assist with maintaining aggregate values in numerical tables).
- Turning the data into graphs or charts is an entirely manual effort (see e.g. T249127: Create regularly updated maps of COVID-19 outbreak). That's a significant burden for enwiki editors, has a large opportunity cost since most potential illustrations just never happen due to lack of capacity for automating them or manually creating them, and it's also a further pain point for cross-wiki reuse since the graphs that do get made are usually not translatable.
- The data is not available outside Wikipedia, e.g. to people who want to build dashboards.
It's worth considering how we can improve this situation, both in the short term for COVID-19-related efforts, and in the long term more generally.
Acceptance criteria
Have at least one significant COVID-19-related data table which is accessible in a machine-readable way, can be accessed on any Wikimedia wiki via some functionality integrated into wikitext (such as Lua), receives regular updates and does not cause distress to the editor community.