- Affected components: WikibaseQualityConstraint (MediaWiki extension for wikidata.org), Wikidata Query Service.
- Engineer for initial implementation: Wikidata team (WMDE).
- Code steward: Wikidata team (WMDE).
Motivation
This RFC is a result of T204024: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache. Specifically, the request to create an RFC in T204024#4891344.
Vocabulary:
- WikibaseQualityConstraints (WBQC, or MediaWiki), a MediaWiki extension deployed on www.wikidata.org.
- Wikidata Query Service (WDQS, or Query service), the service at https://query.wikidata.org.
Current situation:
The Query service performs checks on Wikidata entities on-demand from users. Results of these constraint checks are cached in MediaWiki (WBQC) using Memcached with a default TTL of 1 day (84600 seconds).
The constrain checks are accessible via 3 methods:
- RDF action: https://www.wikidata.org/wiki/Q123?action=constraintsrdf
- Special page: https://www.wikidata.org/wiki/Special:ConstraintReport/Q123
- API: https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=Q123
The special page and API can be used by users directly. The API is called by client-side javascript whenever a logged-in user visits an entity page on www.wikidata.org, the JS then displays the results on the entity page.
The RDF page-action exists for use by the WDQS and will not run the constraint check itself, it only exposes an RDF description of the currently stored constraints that apply to this entity.
The special page currently always re-runs the constraint checks via WDQS, it does not get or set any cache.
The API only makes an internal request to WDQS if the constraint checks data is out of date, absent, or expired for the current entity. When the API retrieves data from the cache, the WBQC extension has logic built-in to determine if the stored result needs to be updated (i.e because something in the dependency graph has changed).
We are in the process of rolling out a JobQueue job that will re-run constraint checks for an entity post-edit, rather than on only on-demand by a user (T204031). This way, they are more likely to be in-cache when requested shortly after by the Query service. We could make the Job emit some kind of event that informs the Query service to pull the API to ingest the new data (T201147).
Loading and re-loading of data into the WDQS will also present the need to dump all constraint checks.
5,644 out of 5,767 properties on Wikidata currently have constraints that require a (cacheable) check to be executed. Of the roughly 54 million items, 1.85 million items do not have any statements, leaving 52 million items that do have statements and need to have constraint checks run. Constraint checks also run on Properties and Lexemes but the number there is negligible when compared with Items.
Constraint checks on an item can take a wide variety of times to execute based on the constraints used. Full constraint checks are logged if they take longer than 5 seconds (INFO) or 55 seconds (WARNING) and the performance of all constraint checks is monitored on Grafana.
Some full constraint checks reach the current interactive PHP time limit while being generated for special pages or the API.
Problem statement:
Primary problem statement:
- Constraint check results need to be loaded into WDQS, but we don't currently have the result of all constraints checks for all Wikidata items stored anywhere.
Secondary problem statements:
- Generating constraint reports when the user requests them leads to a bad user experience as they must wait for a prolonged amount of time.
- Users can flood the API generating constraint checks for entities putting unnecessary load on app servers.
Requirements
- Data can be persistently stored for every Wikidata entity (after every edit).
- Only current state (not historical state) needs to be stored
- Data can be stored from MediaWiki / Wikibase
- Data can be retrieved from storage from MediaWiki / Wikibase
- Storage can be dumped (probably via a MediaWiki maintenance script) into a file or set of files (for WDQS loading)
Exploration
Proposal
- Rather than defaulting to running constraint checks upon a users request primarily pre generate constraint check results post edit using the job queue. T204031
- Rather that storing constraint check results in memcached, store them in a more permanent storage solution.
- When new constraint check results are stored, fire and event for the WDQS to listen to so that it can load the new constraint check data
- Dump constraint check data from the persistent storage to allow for dumping to file and loading into WDQS.
- Use the same logic that currently exists to determine if the stored constraint check data needs updating when retrieve.
- Alterations to the special page to load from the cache? Provide the timestamp of when the checks were run? Provide a way to manually purge the checks and re run (get the latest results) with a button from the page.
Note: Even when constraint checks are run after all entity edits, the data persistently stored will slowly become out of date (therefore also the data stored by WDQS). The issue of 1 edit needing to trigger constraint checks on multiple entities is considered a separate issue and is not in the scope of this RFC.