Page MenuHomePhabricator

Support external tabular datasets in WDQS
Open, MediumPublic

Description

What would be the best way to integrate WDQS with the tabular data, as well as other CSV sources? For example, if a large dataset provider publishes CSV or TSV files, and WDQS wants to federate with it, we could do something like this (this example should be modified as we flash out the exact interface)

SELECT * WHERE {
  # This is a well known public data source for stock quotes - Tesla daily, first lines:
  # Date	Open	High	Low	Close	Volume	Ex-Dividend	Split Ratio	Adj. Open	Adj. High	Adj. Low	Adj. Close	Adj. Volume
  # 2017-11-24	313.79	316.41	311	315.55	3242220	0	1	313.79	316.41	311	315.55	3242220
  SERVICE wikibase:tabular {
    # Data location
    bd:serviceParam wikibase:url <https://www.quandl.com/api/v3/datasets/WIKI/TSLA.csv> .

    # CSVFormat constant, e.g. EXCEL, MYSQL, RFC4180, TDF. Default = 'DEFAULT'
    bd:serviceParam wikibase:csvFormat 'DEFAULT' .
    # If true, treat the first row as header. Default - depends on csvFormat
    bd:serviceParam wikibase:firstRowIsHeader true .
    # If true, use tabular:<column_name>, otherwise use tabular:<column_number> (1-based)
    # By default, this value is the same as firstRowIsHeader
    bd:serviceParam wikibase:csvColumnByName true .

    # Parse columns into variables by their name
    ?date    tabular:Date  'date:yyyy-mm-dd' .  # parse as date
    ?dateStr tabular:Close 'string' .           # unparsed date value
    ?close   tabular:Close 'double' .           # parse as double
  }

  # Extract a single date/close value
  FILTER ( ?date = "2017-11-24T00:00:00Z"^^xsd:dateTime )
}

Expected result:

?date?dateStr?close
2017-11-242017-11-24315.55

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Yurik updated the task description. (Show Details)
Yurik updated the task description. (Show Details)

I don’t think we should allow integration with datasets in arbitrary URLs, for the same reasons that we don’t allow federation to arbitrary SPARQL endpoints (DoSing a website from WMF’s servers, license issues, …). Let’s restrict this to tabular data on Commons. (That also resolves the need for supporting different CSV formats.)

@Lucas_Werkmeister_WMDE I agree - I am planning to implement this feature for both WDQS and Sophox QS. For WDQS, it should only support tabular datasets, or possibly other respected sources.

The first version of this feature has been implemented in Sophox -- see docs. At this point, it supports any GET request that returns CSV-style data (parsable by Java's CSVParser, with many parameters).

If @Smalyshev has any spare time to review the code at https://github.com/nyurik/wikidata-query-rdf/tree/tabular , I will try to port it to support .tab pages as well, with a slightly different set of input parameters.

Some thoughts on the implementation:

Parameter matching seems to be backwards:

?url tabular:url 'uri' .
?type tabular:type 'string' .

If I understand it right, 'uri', 'string', 'integer' etc. are types. As such, there's a limited set of them and they should be predicates. While the column names - which are user-defined - should be strings. I.e.:

?url wikibase:csvUri "url" .
?type wikibase:csvString "type" .

etc.

Not sure about using wikibase: prefix - this data is not exactly related to wikibase. Maybe some other prefix is better.

About the code:

  • Not sure why it uses MultiSearchIterator - there doesn't seem to be any usage of external bindings? Unless it is allowed to use variable in binding URL? Is that necessary? I would imagine most usage would be with constant URL, am I wrong?
  • Same goes for all query parameters - they are all accept bindings. Do we really need to make them all variable? Is there a use case for it?
  • otQuery probably shouldn't be in generic WDQS code, so if we merge this into WDQS code, we need to see how it can be supported.
  • I think the URL should be checked against whitelist.txt. We can have another whitelist but I think it'd be just complicating the matters.
  • Would be nice to get it as a patch to WDQS in gerrit (sans OSM-specific code)

@Smalyshev the reason I made type as a string is to allow additional parsing parameters, e.g. ?start tabular:startDate 'date:yyyy-mm-dd'

  • MultiSearchIterator and binding params

Correct, most usages would be static, but in theory it might be possible to supply URL or other parsing params depending on some dynamic calculation and other data, right? Its not a must have requirement, but if there are no performance or other major disadvantages, I think it's better to support both?

  • I am not sure what would be the best way to split WDQS and OSM code, yet package it together. Any suggestions?
  • whitelist - agree, if whitelist exists, i think it should be used.

Its not a must have requirement, but if there are no performance or other major disadvantages,

Complexity is one. I'm a big fan of YAGNI :) I'm not 100% against supporting it, but doubtful whether it's actually needed.

I am not sure what would be the best way to split WDQS and OSM code

Since context controls the link between service URI and class, we could have two classes, one implementing base API and one extending it, and have WDQS part only have the base class and OSM part have extended class and put it as URI implementation instead of the base one. Not sure how easy this is to do but it looks possible (without looking deeply into the code).

Smalyshev changed the task status from Open to Stalled.Dec 13 2018, 6:21 PM
Smalyshev triaged this task as Medium priority.

Just a bit of info on why I've nominated this for the 2020 hackthon:
There is currently a major issue with storing statistical data in Wikidata, which would be solved if we could upload the data to Commons as Tabular Data files.
It's proving unsustainable to use qualifiers to break down statistical data by date, let alone using any further breakdown (e.g. gender or age group). Items which need to store such data (e.g. countries) end up far too cluttered with statements so we simply have to choose not to keep historical data or any further breakdown.
We can currently store the tabular data on Commons, but without being able to query it alongside data in Wikidata it has a tiny fraction of the value. Using UNESCO and World Bank data for example, we would be able to generate charts like (Percentage of Primary age girls out of school in Algeria since 1970). Obviously that is just one example in one domain, the possibilities are truly endless!!

@NavinoEvans I agree - feel free to take my implementation (which was already working for any CSV-style inputs), and extend/adapt it. Ideally, it should be merged upstream to the Blazegraph, so it should support any kind of CSVs. It may make sense to have either some sort of a wrapper for the tabular datasets as an extension to Blazegraph, or alternatively to extend the jsonconfig's API to be able to get CSV directly (which might be a better solution, as it would allow other, non-blazegraph usages)

Aklapper changed the task status from Stalled to Open.Nov 8 2020, 2:10 PM

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status, as tasks should not be stalled (and then potentially forgotten) for years for unclear reasons.

(Smallprint, as general orientation for task management:
If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead.
If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks.
If this task is stalled on an upstream project, then the Upstream tag should be added.
If this task requires info from the task reporter, then there should be instructions which info is needed.
If this task needs retesting, then the TestMe tag should be added.
If this task is out of scope and nobody should ever work on this, or nobody else managed to reproduce the situation described here, then it should have the "Declined" status.
If the task is valid but should not appear on some team's workboard, then the team project tag should be removed while the task has another active project tag.)

I don’t think we should allow integration with datasets in arbitrary URLs

Since it is probably safer to have another service (one that individuals can deploy for their own needs) access the tabular data and then federate with WDQS here is an example of just that:

https://github.com/justin2004/weblog/tree/master/blend_google_sheet_with_wikidata

More and more people is working on data visualizations, and lots of data scientists would benefit from this. Currently, we can upload a relatively large .tab data file to Commons and we can query for items who are producing this data, but we can't merge both.

This project by Wikipedia Canada is a good example: https://meta.wikimedia.org/wiki/Projet_ECCC,_100_ans_de_données_météorologiques_en_accès_libre/en. They have uploaded all the weather data from all the meteorological stations, and we can search things like: meteo stations that are open and are less than 10 km. from an airport, but we can add the data from the .tab to this query.

Global warming, demographic changes or pandemic related issues would be better served with this feature.

What is blocking now previously cited solutions (@Justin0x2004 has one) go into WDQS? Thanks!

Note from Wikidata Data Re-use Days: @Mike_Peel mentioned that some Items are huge (e.g. 4.3MB for Q87483673). This is problematic! @Mahir256 noted that this task might be a solution.

Note from Wikidata Data Re-use Days: @Mike_Peel Q says some Items are huge (e.g. 4.3MB for Q87483673). This is problematic! @Mahir256 noted that this task might be a solution.

Q87483673 is a great example of COVID-19 case data that currently isn’t a good fit for Wikidata statements. By comparison, these nine tables have been updating daily since the start of the pandemic. It’s far easier to update these tables than the corresponding Wikidata items. Wikidata prefers keeping outdated statements, but backdated numbers routinely get revised (months into the past), so handling this kind of data properly requires more than adding a statement every day. Unfortunately, the lack of querying functionality has limited the usefulness of these tables – basically, they’re only used for visualizations on Wikipedia that focus on a single table at a time.

The wikidata:tabular service mentioned in https://phabricator.wikimedia.org/T181319#3860648 sounds like it would be a great starting point. It was live on Sophox for a while but got disabled in 2018 because of divergence from the main codebase. Both CSV and tabular data JSON would be exciting to have access to in SPARQL queries.

Just to comment that the example came from https://www.wikidata.org/wiki/Special:LongPages - you can see the first page of these is either COVID or journal article items. It maybe a case of changing the approach of an editor or two on-wiki to avoid these situations arising, more than it requires software changes, though.

Just to comment that the example came from https://www.wikidata.org/wiki/Special:LongPages - you can see the first page of these is either COVID or journal article items. It maybe a case of changing the approach of an editor or two on-wiki to avoid these situations arising, more than it requires software changes, though.

That feels intuitive, but it's not a single digit numbers of editors that want to add large amounts of tabular style data to items. This is a recurring issue coming from many editors in the community, all trying to solve some real use case (sometimes only to be able to query for it, sometimes to be able to display it in some fashion on Wikipedia).

For COVID items, a workaround is spliting them to yearly items.

If it is running as an separate service and data would be queried via federation then SPARQL-anything could be used

The OSM Sophox service, which is based on WDQS, has or had a SERVICE wikibase:tabular which appears to have providing this capability,
see https://wiki.openstreetmap.org/wiki/Sophox#External_Data_Sources
so there may be code already in existence that could be merged, or at least used as a basis

EDIT: I see Yurik talked about this above, Dec 26 2017

I have asked one of the authors of the Module:NUMBEROF about writing something similar than can lookup for rows and columns: https://fr.wikipedia.org/wiki/Discussion_utilisateur:Verdy_p#Using_something_like_NUMBEROF_to_query_.tab_data_at_Commons

Is there a way to categorize tabular data in Commons' "Data:" namespace?

For now all these are using a JSON format, which does not support any comment (in standard JSON, unless there's an extension added and supported in Commons' Data, even if they may be stripped when requesting data in standard JSON format; such extensions exist and have other features, such as no need to use any quotation marks around property names or strnig values that have the form of a valid ID, or adding C/C style block comments and inlien comments; note that when requesting data in JSON format, why is the JSON format using so many newlines and whitespaces for indentation, this form may be useful when editing data in JSON syntax, but the wiki interface on Commons will preferably allow editing those data in a visual table, and when requesting JSON data from other sites or queries, we never need so many whitespaces and embedded comments can be easily stripped as they can be regenrated automatically; as well properties of objects should be automativally stored and retrieved sorted by their property key, to canonicalize the format, because the order of named keys is not significant, and objects or arrays that do not contain any object or array item should not use any newline in them; if objects have properties whose value is not an array or object, these properties should be listed first before properties having a value with an object-type or array-type value, in fact this also facilitates the editing). But the description part has "fields" and "source" items and could contain additional properties for categories, and possibly some relevant subject QID (pointing to Wikidata).

  • Is the JSON schema used extensible? (while keeping data usable and translatable/adaptable across wikis and sites, e.g. by using Wikidata QIDs or other standardized identifiers, e.g. from ISO 3166 or other relevant national or technical standards which are also found in Wikidata)
  • Can that additional categorisation data be used to sort all these files and present them to the MediaWiki UI (just like "source" presented at bottom of the generated web page)
  • (an alternative has been made for some data in the United Kingdom, by categorizing for now their associated "Data talk" page instead: is that the way to go? I'm not sure that this allows cross-site reutilisation, talk pages are not meant to be used this way and can contain many unrelated categories or info, including personal info, or maintenance/quality assessment tags or other project-related discussions, or tests and links and template transclusions to various tools, not meant to be generalist and possibly hosted on external sites and not necessarily up to date; in addition these talk pages are likely to be archived and updated frequently, possibly breaking the existing categorization if it is made at the bottom of pages, or anywhere else in the middle of talk pages, and there's no standard page header to avoid that mix).

The absence of categories also means that such data in Commons is hard to search for (and then hard to assess or update when needed). This problem does not occur with any other files/medias or pages, including JS/CSS pages where categories are recognized at least in their embedded comments.

But may be it's time to think about having that metadata editable separately: we have page properties (including notably the content-model) that we should be able to use more easily and in a "structured" way (exactly what Semantic Mediawiki does).

The regular thing is to categorize the talk pages itself, instead of the pages.

Yes but this adds many constraints on talk pages (e.g. with archiving of discussions, or people attempting to talk and posting incorrect links to categories in these talk pages, that will become then hard to cleanup if everything is mixed (forcing to edit all these discussions, especially if they post contents with template calls). The alternative would be to use other talking systems, but LiquidThread for example is no longer maintained and has its problems.

We should think about real metadata as part of "Data:*" pages containing JSON. It should be possible by using a format for the page that contains both the JSON and the categories, usign a MediaWiki syntax. The JSON part would itself be embedded and queryable from the wiki page. Another way would be to really integrate a real metadata store for all pages (which could also integrate categories, but it will be difficult due to the way they are inserted conditionally by MediaWiki templates, so they are still stored in the Wiki content itself). The last alternative is Semantic MEdiawiki that implements such separate store. If you don't want to incldue all of Semantic Mediawiki, may be a part of it, compatible with it (so that it can still contain its own extensions).
The last alternative it to use Wikidata for this store... But all this is a huge architectural problem for its integration.

Why wasn't a system like the pair of "File:"/"Media:" namespaces used for "Data:"? After all JSON data files are like "Media:" with their own Media-Type, and the wiki page should be used for its description and categorization, and the pair "Data:"/"Media:" namespaces would have worked as well, reusing most of what we already do for all other media file types, and it could have made possible to also use different schemas for JSON files, or schemas for other data formats (like CVS/TSV, XML, may be even spreadsheet formats, or Turtle, RDF triples, or unstructured/flat key stores).