Wikidata:WikiProject Limits of Wikidata
About
editThis Wikiproject aims at bringing together various strands of conversations that touch upon the respective limits of Wikidata, both in technical and social terms. The aim is not to duplicate existing documentation but to collect pointers to places where the respective limits for a given section are being described or discussed. For background, see here.
Timeframe
editWhile fundamental limits exist in nature, the technical and social limits we are discussing here are likely to shift over time, so any discussion of such limits will have to come with some indication of an applicable timeframe. Since the Wikimedia community has used the year 2030 as a reference point for its Movement Strategy, we will use this here as a default as well for projections into the future and contrast this with current values (which may be available via Wikidata's Grafana dashboards). If other timeframes make more sense in specific contexts, please indicate that.
Design limits
edit"Design limits" are the limits which exist by intentional design of the infrastructure of our systems. As design choices, they have benefits and drawbacks. Such infrastructure limits are not necessarily problems to address and may instead be environmental conditions for using the Wikidata platform.
Software
editKnowledge graphs in general
edit- List of deployments of large triple stores
- Largest entry in the list has 1.08 Trillion triples
- Does not mention Wikidata but Blazegraph
- Too Much Information: Can AI Cope with Modern Knowledge Graphs? (Q64706055)
MediaWiki
editmaxlag
editmw:Manual:Maxlag parameter, as explained here.
API requests specifying the maxlag parameter will see the wiki as read-only if any replica server is lagged by more than the parameter.
(It’s customary for bots to specify maxlag=5
[seconds].)
Query service lag is factored into maxlag according to a scaling factor (wgWikidataOrgQueryServiceMaxLagFactor
):
in production the factor is 60 (as of 2024-08-08),
i.e. a WDQS backend server lagged by five minutes counts like a database replica lagged by five seconds.
If a majority of database replicas are lagged by more than three seconds ($wgAPIMaxLagThreshold in production, as of 2024-08-08),
all API requests will see the wiki as read-only
(including all human edits to entities; Wikitext pages will remain editable).
If lag on a majority of replicas exceeds six seconds ('max lag'
in the production database configuration, as of 2024-08-08),
the wiki becomes fully read-only until replication catches up again.
Page size
editThe maximum page size is controlled by $wgMaxArticleSize and maxSerializedEntitySize. It’s not clear which of them apply to entities. (They are set to 2 MiB and 3 MiB in production, respectively.)
Special:LongPages suggests the effective maximum page size is a bit above 4 MiB; the historical maximum page size can be seen on Grafana (though data before October 2021 has been lost).
Page load performance
editmw:Wikimedia Performance Team/Page load performance, as explained here.
Wikibase
editGeneric Wikibase Repository
edit- By design, the repository stores statements, that *could* be true. There is no score yet, that describes the validity or "common sense agreement" of that statement.
Data types
edit- Item
- Monolingual string
- Single value store, but no time-series for KPIs
Data formats
edit- JSON
- RDF
- etc.
Generic Wikibase Client
editWikidata's Wikibase Repository
editWikidata's Wikibase Client
editWikibase Repositories other than Wikidata
editWikibase Clients other than Wikidata
editWikidata bridge
editWikimedia wikis
editNon-Wikimedia wikis
editWikidata Query Service
editSee also Future-proof WDQS.
- Wikidata Query Service — Workboard on Phabricator
- Limits of the Wikidata Query Service are being discussed in
- this email from 6 June 2019, which links to Wikidata query service/ScalingStrategy (a collection of notes from a meeting in February 2019) and to a Phabricator project.
- this email from 7 February 2020
- Wikidata:Query Service scaling update Aug 2021
Triple store
edit- See also en:Comparison of triplestores
Blazegraph
edit- List of open issues
- limited support for sharding
- GitHub repo
- not actively maintained
As of 2024-08-08, known major issues with Blazegraph include:
- Re-importing the Wikidata data from scratch (e.g. to a new server, or if the data file got corrupted) is extremely slow (on the order of weeks).
- Sometimes there are issues with Blazegraph allocators that require a restart of a server (and subsequently catching up with missed updates).
Virtuoso
editJANUS
editApache Rya
editApache Rya (Q28915769), source code (no activity since dev 2020), manual
- it supports SPARQL queries and distributed storage using Apache Accumulo (Q63241426) which is built on Hadoop Distributed Filesystem (Q20072248).
- The Apache Software Foundation Announces Apache Rya as a Top-Level Project
- Would this be of interest to Wikidata?
Oxigraph
editFrontend
editTimeout limit
editQueries to the Wikidata Query Service time out after a certain time, which is a parameter that can be set.
There are multiple related timeouts, e.g. a queryTimeout
behind Blazegraph's SPARQL LOAD
command or a timeout
parameter for the WDQS GUI build job.
JavaScript
editThe default UI is heavy on JavaScript, and so are many customizations. This creates problems with pages that have lots of statements in that they load more slowly or freeze the browser.
Python
editLua
editSPARQL
editHardware
edit- See Wikidata Query Service/Implementation#Hardware
- Wikidata minimal hardware requirements for loading wikidata dump in Blazegraph, June 2019
- From Your own Wikidata Query Service, with no limits (part 1):
- "Firstly we need a machine to hold the data and do the needed processing. This blog post will use a “n1-highmem-16” (16 vCPUs, 104 GB memory) virtual machine on the Google Cloud Platform with 3 local SSDs held together with RAID 0."
- "This should provide us with enough fast storage to store the raw TTL data, munged TTL files (where extra triples are added) as well as the journal (JNL) file that the blazegraph query service uses to store its data.
- "This entire guide will work on any instance size with more than ~4GB memory and adequate disk space of any speed."
Functional limits
editA "functional limit" exists when the system design encourages an activity, but somehow engaging in the activity at a large scale exceeds the system's ability to permit that activity. For example, by design Wikidata encourages users to share data and make queries, but it cannot accommodate users doing a mass import of huge amounts of data or billions of quick queries.
A March 2019 report considered the extent to which various functions on Wikidata can scale with increased use - wikitech:WMDE/Wikidata/Scaling.
Wikidata editing
editEdits by nature of account
editEdits by human users
editManual edits
edit- ...
Tool-assisted edits
edit- ...
Edits by bots
edit- ...
Edits by nature of edit
editPage creations
editPage modifications
editPage merges
editReverts
editPage deletions
editEdits by size
editEdits by frequency
editWDQS querying
editA clear example where we encounter problems, is SPARQL queries against the WDQS where things of some type (P31) are asked for, involving large number of hits. For example, querying all scholarly article titles. Queries that involve fewer items of that type do not typically give these issues.
- Scaling Wikidata Query Service from Wikidata-l, 6 June 2019
Query timeout
editThis is a design limit discussed under #Timeout limit above. It manifests itself as an error when the query takes more time to run than the timeout limit allows for.
-
Some queries time out and trigger error messages not useful to the end user. This one should give a diagram of citation counts over time, normalized by number of co-authors, as in this example.
Queries by usage
editOne-off or rarely used queries
editShowcase queries
editMaintenance queries
editConstraint checks
edit- Some format constraint checks use the Wikidata Query Service, to the tune of tens to hundreds of thousands of regex tests per minute
Queries by user type
editManually run queries
editQueries run through tools
edit- e.g. Scholia
-
Some query embeds do not give any content when the query times out. This one should give a list of recent citations for a given organization, as in this example.
Queries run by bots
editQueries by visualization
edit- Table
- Map
- Bubble chart
- Graph
- etc.
Multiple simultaneous queries
edit-
If multiple SPARQL queries are issued from the same IP address (e.g. via a Scholia page) within a short period of time, then a "Rate limit exceeded" error occurs.
Wikidata dumps
edit- See also wikitech:Dumps/WikidataDumps
Creating dumps
editUsing dumps
editIngesting dumps
editIngesting dumps into a Wikibase instance
editIngesting dumps into the Wikidata Toolkit
editUpdating Triple Store Content
edit-
Lag of Wikidata Query Service servers on an hourly scale (see current data)
-
Lag plot at a monthly scale
Creating large numbers of new items itself does not seem to cause problems (except the aforementioned WDQS querying issue). However, there frequently is a lag between updating the wiki pages of Wikidata and the updates being propagated to the Wikidata Query Service servers.
Edits to large items
editPerformance issues
editOne bottleneck is the editing of existing Wikidata items with a lot of properties. The underlying issue here is that, for each edit, RDF for the full item is created and that the WDQS needs to update that full RDF. Therefore, independent of the size of the edit, edits on large items stress the system more than edits on small items. There is a Phabricator ticket to change how the WDQS triple store is updated.
Page size limits
editPages at the top of Special:LongPages are often at the size limit for a wiki page, which is set via $wgMaxArticleSize.
Merged QuickStatement edits
editThe current QuickStatement website is not always efficient in making edits: adding a statement with references can result in multiple edits. This feature makes QuickStatement make the Large item edits issue very visible.
Human engagement limits
edit"Human engagement limits" include everything to do with human ability and attention to engage in Wikidata. In general Wikidata is more successful when humans of diverse talent and ability enjoy putting more attention and time into their engagement with Wikidata.
Limits in this space include the number of contributors Wikidata has, how much time each one gives, and the capacity of Wikidata to invite more human participants to spend more time in the platform.
Wikidata users
editHuman users
editHuman Wikidata readers
editHuman Wikidata contributors
edit- Format is machine friendly but not human-friendly - hard for new editors to understand. Necessary to ensure that Wikidata brings in data that may not be already on the internet.
- Difficult for college classes/instructors to know how to organize mass contributions from their students, such as Wikidata_talk:WikiProject_Chemistry#Edits_from_University_of_Cambridge.
- Effective description of each type of entity requires guidance for the users who are entering a new item: What properties need to be used for each instance of tropical cyclone (Q8092)? How do we inform each user entering a new book item that they ought to create a version, edition or translation (Q3331189) and a written work (Q47461344) entity for that book (per Wikidata:WikiProject_Books). In other words, how do we make the interface self-documenting for unfamiliar users? And where we have failed to do so, how do we clean up well-intentioned but non-standard edits by hundreds or thousands of editors operating without a common framework?
Human curation
edit- Human curation of massive automated inputs of data - tool needed to ensure that data taken from large databases are reliable? Can we harness the power of human curators, who may identify different errors than machine-based checks?
Tools
editTools for reading Wikidata
editTools for contributing to Wikidata
editTools for curating Wikidata
edit- "Wikidata vandalism dashboard". Wikimedia Toolforge.
- "Author Disambiguator". Wikimedia Toolforge.
Bots
editBots that read Wikidata
editBots that contribute to Wikidata
editUsers of Wikidata client wikis
editUsers of Wikidata data dumps
editUsers of dynamic data from Wikidata
editAPI
editSPARQL
editLinked Data Fragments
editOther
editUsers of Wikibase repositories other than Wikidata
editContent limits
edit"Content limits" describe how much data Wikidata can meaningfully hold. Of special concern is limits on growth. Wikidata hosts a certain amount of content now, but limits on adding additional content impede the future development of the project.
A March 2019 report considered the rate of growth for Wikidata's content — wikitech:WMDE/Wikidata/Growth. A similar report was compiled in September 2024.
Generic
editHow many triples can we manage?
editWikidata Query Service (WDQS) is already experiencing stability issues that are related to the graph size with the current (May 2024) number of triples in the graph (~16 Billions). While there is no strict limit to the number of triples that Blazegraph can support, stability issues due to race conditions occur (see T263110). This is fundamentally a software issue that is unlikely to be fixed by more powerful hardware.
The failure modes we are experiencing are the Blazegraph journal being corrupted, leading to the failure of the affected server. This happens more often during data load, when the system is under more stress. When failures occur during data load, the process has to be restarted from scratch, leading to reload time of > 30 days.
Most of those limitations have been explained in past updates.
The WMF Search Platform team is currently working on splitting the WDQS Graph into multiple sub graphs to mitigate this risk.
How many languages should be supported?
editHow to link to individual statements?
editItems
editHow many items should there be?
editThe Gaia project released data so far on over 1.6 billion stars in our galaxy. It would be nice if Wikidata could handle that. OpenStreetMap has about 540 million "ways". The number of scientific papers and their authors is on the order of 100-200 million. The total number of books ever published is probably over 130 million. OpenCorporates lists over 170 million companies. en:CAS Registry Number's have been assigned to over 200 million substances or sequences. There are over 100 large art museums in the world each with hundreds of thousands of items in their collection, so likely at least tens of millions of art works or other artifacts that could be listed. According to en:Global biodiversity there may be as few as a few million or as many as a trillion species on Earth; on the low end we already are close, but if the real number is on the high end, could Wikidata handle it? Genealogical databases provide information on billions of deceased persons who have left some record of themselves; could we allow them all here?
From all these different sources, it seems likely there would be a demand for at least 1 billion items within the next decade or so; perhaps many times more than that.
How many statements should an item have?
edit- The top-listed items on Special:LongPages have over 5000 statements. This slows down operations like editing and display.
Properties
editHow many properties should there be?
editHow many statements should a property have?
editLexemes
editHow many lexemes should there be?
edit- English Wiktionary has about 6 million entries (see wikt:Wiktionary:Statistics); according to en:Wiktionary there are about 26 million entries across all the language variations. These numbers give a rough idea of potential scale; however they cannot be translated directly to expected lexeme counts due to the structural differences between Wikidata lexemes and Wiktionary entries. Lexemes have a single language, lexical category and (general) etymology, while Wiktionary entries depend only on spelling and include all languages, lexical categories and etymologies in a single page. On the other hand, each lexeme includes a variety of spellings due to the various forms associated with a single lexeme and spelling variations due to language varieties. Very roughly, then, one might expect the eventual number of lexemes in Wikidata to be on the order of 10 million, while the number of forms might be 10 times as large. The vast majority of lexemes will likely have only one sense, though common lexemes may have 10 or more senses, so the expected number of senses would be somewhere in between the number of lexemes and the number of forms, probably closer to the number of lexemes.
How many statements should a lexeme have?
editSo far there are only a handful of properties relevant for lexemes, in each case likely to have only one or a very small number of values for a given lexeme. So on the order of 1 to 10 statements per lexeme/form/sense seems to be expected. However, if we add more identifiers for dictionaries and link them, there's a possibility we may have a much larger number of external id links per lexeme in the long run - perhaps on the order of the number of dictionaries that have been published in each language?
References
editHow many references should there be?
editHow many references should a statement have?
editWhere should references be stored?
editSubpages
editParticipants
editThe participants listed below can be notified using the following template in discussions:{{Ping project|Limits of Wikidata}}
- Daniel Mietchen (talk) 06:55, 13 May 2019 (UTC)
- Envlh (talk) 10:11, 13 May 2019 (UTC)
- Blue Rasberry (talk) 14:32, 13 May 2019 (UTC)
- ·addshore· talk to me! 20:24, 13 May 2019 (UTC)
- Nizil Shah (talk) 06:13, 14 May 2019 (UTC)
- Sannita - not just another it.wiki sysop 09:14, 14 May 2019 (UTC)
- Dhx1 (talk) 13:40, 30 May 2019 (UTC)
- Bbober
- Buccalon (talk) 16:05, 18 June 2019 (UTC)
- Egon Willighagen (talk) 06:36, 4 August 2019 (UTC)
- ArthurPSmith (talk) 13:06, 4 August 2019 (UTC)
- ChristianKl ❪✉❫ 14:20, 5 August 2019 (UTC)
- Jneubert (talk) 15:28, 19 August 2019 (UTC)
- --Tinker Bell ★ ♥ 05:25, 4 October 2019 (UTC)
- Mahir256 (talk) 04:21, 13 October 2019 (UTC)
- Pdehaye (talk) 09:52, 29 October 2019 (UTC)
- Peaceray (talk) 22:28, 9 November 2019 (UTC)
- Supertrinko (talk) 01:11, 23 June 2021 (UTC)
- Finn Årup Nielsen (fnielsen) (talk) 13:13, 9 August 2021 (UTC)
- So9q (talk) 08:39, 26 August 2021 (UTC)
- Mathieu Kappler (talk) 11:30, 6 September 2021 (UTC)
- Simon Cobb (User:Sic19 ; talk page) 17:24, 7 October 2021 (UTC)
- Mitar (talk) 10:09, 21 May 2022 (UTC)
- Simon Villeneuve (talk) 12:33, 4 August 2022 (UTC)
- Waldyrious (talk) 09:23, 21 November 2022 (UTC)
- Maxime
- Sj
- TiagoLubiana (talk) 22:02, 17 October 2023 (UTC)
- Luca.favorido
- Peter F. Patel-Schneider