Given Wikidata is currently grows at 3-10% a week, we need to make the Wikidata entity dumpers keep up with that.
The changes in batch size (4eedfb48e9fdc93eea13d9fd3bd341e66c1abfbc) and https://github.com/wmde/WikibaseDataModel/pull/762 will already ease some of the pain, but given the immense growth, this can probably hardly offset four weeks of Wikidata growth.
Possible things to do:
- Create a "master dump" (or some such) which all other dumps can be derived from (this will ease the pain on the DBs, but hardly considering CPU time)
- Increase the number of runners further (from 5 currently) https://gerrit.wikimedia.org/r/383414
- Try to derive old dumps from new ones (not quite easy to do and not sure how much to gain here)
- Do more profiling and try to find more low-hanging fruits (like the examples above, or T157013)
- Switch away from PHP5 to PHP7 or HHVM (also see the related discussion at T172165)
- Find the right --batch-size (https://gerrit.wikimedia.org/r/384204)
- …
Patch-For-Review/ TODOs:
- https://gerrit.wikimedia.org/r/380628
- https://github.com/wmde/WikibaseDataModel/pull/762
- https://github.com/wmde/WikibaseDataModel/pull/764
- Release https://github.com/wmde/WikibaseDataModel/pull/762 and https://github.com/wmde/WikibaseDataModel/pull/764. Note: Will be deployed in 1.31.0-wmf.5.
- https://gerrit.wikimedia.org/r/383414
- https://gerrit.wikimedia.org/r/384204
- Conclude and revert/ alter https://gerrit.wikimedia.org/r/384204
- T178247: Use a retrieve only CachingEntityRevisionLookup for dumps
- Conclude T178247: Use a retrieve only CachingEntityRevisionLookup for dumps -> 63643dd1c751b5d671a7deae5b2d181275340a5b
- Deploy 56906993f95067ec156cf3412f2dabaefce282ad (will happen with 1.31.0-wmf.27
on 2018-03-28in early April) - Only dump a limited number of entities in a single dump script run and not all that match the current shard. T177550 T190513
I consider this task done when the dumps finish no later than mid-Thursday again and don't run well into the weekend.