Page MenuHomePhabricator

[Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history
Open, Needs TriagePublic5 Estimated Story Points

Description

Problem

The wikidatawiki is not present on wmf.mediawiki_wikitext_history

image.png (193×705 px, 18 KB)

The dump seems to exist on the XML dumps, according to this: https://dumps.wikimedia.org/backup-index.html

We need this data to retrain/update the Wikidata Revert Risk Model T363718

Proposed Solution

  • In https://phabricator.wikimedia.org/T357859 we skipped the loading of wikidata to the tables. We did this to speed up delivery of other wiki’s - (roughly sped up by ~12 days) - this will stay the same
  • Plan is to run a second loading job (same code) once the first job is complete to load wikidata (we will create a secondary loading list)
  • This will allow us to keep our delivery time improvements in place for the movement insights reporting and keep loading wikidata to the tables

Event Timeline

@diego - we skipped loading wikidatawiki as part of this ticket:

https://phabricator.wikimedia.org/T357859

Let us know if any issues with using the alternative.

@lbowmaker if understand correctly, there is no alternative for obtaining historical data for Wikidata edits? If this is the case, we can't keep the Wikidata Revert Risk model updated

I believe @AndrewTavis_WMDE was running into similar issues for T362849 and there is T363451. Does that help any?

lbowmaker set the point value for this task to 5.May 6 2024, 3:37 PM

@lbowmaker the proposed solution sounds ok to me. I have two questions around:

  • (Approximately) when you think you would have a Wikidata Dump that we can work with? We need this for deploying the Wikidata Revert Risk model, and this is our main blocker.
  • Similarly, could you give an estimation on the estimated delay that we should expect for each dump? Our goal is to retrain the model periodically, so it would be great to know which would be the gap from the data to be produced and be available on the cluster.

Excellent! Don't forget to announce the plan first, just in case there is someone unexpectedly using the data; I recommend the working-with-data Slack channel and the analytics-announce mailing list.

FWIW, I think this step got skipped.

@diego - when we make the change it should be as timely as it was before we started skipping it.

I moved this as next up in our current work plans. I’ll try and have a more exact date in the next few days.

@diego @nshahquinn-wmf - just chatted with @JAllemandou about this.

The job to load this data was originally scheduled for the 19th of the month. We were waiting ~10 days or so for wikidata so we decided to skip that and run the job on the 7th.

However, Joseph noticed that wikidata xml has been delivered earlier (around the 8th last month) due to some recent changes on the dumps side.

It seems we can just add wikidata back in and it would only cost us ~1 day to our recent time improvements.

Joseph will submit a patch soon to not skip wikidata and load the files. We will monitor the delivery of the files and if wikidata starts to come late again we will build the second loading job.

I'm sorry, what Luke expressed before can't be done, I messed up when looking at wikidata dumps generation: I looked at a subset of files that were generated the 8th, but forgot the rest of hte files /facepalm/.
This means we need to build a parallel pipeline for wikidata dump import as planned originally.
Sorry for the false joy :S

@JAllemandou what's the impact of this as we can see right now? How long do we expect to have this built?

Hi @XiaoXiao-WMF ,
the impact of this issue is that the hive wmf.mediawiki_wikitext_history is currently not containing the wikidatawiki project's data.
I sent a slack message a while asking if anyone was using this data back but made the mistake to only call-out Fabian, from the research team.
I estimate this issue to take between one and two week(s) to be solved once someone starts working on it.

Just chiming in that it would be really really great to unblock @diego's work as the current model on Wikidata is not good enough to support the community in finding vandalism.

Change #1031897 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Add wikidatawiki grouped-wikis file

https://gerrit.wikimedia.org/r/1031897

Change #1032018 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Update script importing XML dumps onto HDFS

https://gerrit.wikimedia.org/r/1032018

Change #1032020 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Update MediawikiXMLDumpsConverter

https://gerrit.wikimedia.org/r/1032020

Change #1032018 merged by Joal:

[analytics/refinery@master] Update script importing XML dumps onto HDFS

https://gerrit.wikimedia.org/r/1032018

Change #1032020 merged by jenkins-bot:

[analytics/refinery/source@master] Update MediawikiXMLDumpsConverter

https://gerrit.wikimedia.org/r/1032020

Change #1031897 merged by Joal:

[analytics/refinery@master] Add wikidatawiki grouped-wikis file

https://gerrit.wikimedia.org/r/1031897

Change #1036614 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Add wikidata history-dumps import to hdfs job

https://gerrit.wikimedia.org/r/1036614

Change #1036614 merged by Brouberol:

[operations/puppet@production] Add wikidata history-dumps import to hdfs job

https://gerrit.wikimedia.org/r/1036614

Hi Folks, I've been late in delivering this but it's landing as I write.
The spark job transforming wikidata-xml-history for snapshot 2024-04 is currently running . I expect it to finish either today or tomorrow.
The spark job is scheduled by a separate Airflow DAG and only computes Wikidata-xml-history, while the other DAG keeps running without wikidata to get data faster on other dumps.
This month having seen errors on the dumps release process, it's a bad timing for tests, but hopefully things will settle and we'll have a proper run next month.

Thank you for the efforts here, @JAllemandou! Really great to have this back, and glad that it's worked out in a way where others are not adversely effected :)

The spark job finished, we have data (from superset):

SELECT
    revision_id,
    revision_timestamp
FROM wmf.mediawiki_wikitext_history
WHERE snapshot = '2024-04'
    AND wiki_db = 'wikidatawiki'
LIMIT 10

revision_id	revision_timestamp
87789334	2013-11-19T14:32:48Z
402389	2012-11-11T15:07:37Z
384705035	2016-10-07T22:50:25Z
82084019	2013-10-28T00:39:33Z
142054144	2014-07-01T18:34:15Z
965664805	2019-06-20T15:09:19Z
239261205	2015-08-08T04:12:58Z
258998691	2015-10-17T00:18:48Z
790349650	2018-11-15T08:36:34Z
2059225311	2024-01-21T22:52:20Z

Hi! Apparently the data has missing again:

SELECT
    revision_id,
    revision_timestamp
FROM wmf.mediawiki_wikitext_history
WHERE  wiki_db = 'wikidatawiki'
LIMIT 10
OK
revision_id	revision_timestamp
Time taken: 2.753 seconds

The airflow sensor timed out. But I never saw an alert for it (maybe it was before this week). I cleared it and will report back here in a bit after it has a chance to think about running again.

Ok, dug into this a bit more. Looks like the job set up to import the dumps XML is running fine but the status file says wikidatawiki is still in progress. Specifically it says this:

"metahistorybz2dump": {"status": "in-progress", "updated": "2024-07-07 15:48:19"}

That updated flag is weird but the job looks like it is indeed still running, and the timing isn't too far off, this is about how long it usually takes plus a few days for the delays this month. So now that the sensor is refreshed, whenever https://dumps.wikimedia.org/wikidatawiki/20240701/ shows DONE then check back and if we still don't have data, ping us again.

oof, I just realized this is for the month BEFORE. I see that's still in-process:

"metahistorybz2dump": {"status": "in-progress", "updated": "2024-06-18 15:16:15"

If two wikidata dumps are running at the same time, that's probably trouble, so I'm gonna do my best to check.

Ok, this ended up being very involved. I believe the root of all the confusion is that all the dumps jobs assume the PREVIOUS dump finished and work only on the CURRENT dump. So we ran around dumpsdata and snapshot hosts, hardcoding 20240701 where it was looking for "latest" and we're not sure whether we broke anything. At the end of the day, we basically figured that the snapshot1010 version of dumps files seemed all good, and we just rsynced them over to dumpsdata and clouddumps hosts. The rsync service that runs ALSO assumes this "latest" thing, but not for all files, just for the status files. So as far as we could tell everything was already rsynced except the status and html files. The monitor/html generation service ALSO assumes this "latest" thing so we weren't able to run that to generate the html, even after trying to hack it, but the html files were already on the snapshot hosts so we just moved those over with rsync too. The base rsync excludes json and html, so we just hacked it to include them instead.

So ultimately this is really all we did:

dumpsgen@dumpsdata1006:$ /usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/wikidatawiki dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/
dumpsgen@dumpsdata1006:$ /usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/wikidatawiki clouddumps1001.wikimedia.org::data/xmldatadumps/public/
dumpsgen@dumpsdata1006:$ /usr/bin/rsync -va --contimeout=600 --timeout=600 --exclude=**bad/ --exclude=**save/ --exclude=**not/ --exclude=**temp/ --exclude=**tmp/ --exclude=*.inprog --exclude=*.txt /data/xmldatadumps/public/wikidatawiki clouddumps1002.wikimedia.org::data/xmldatadumps/public/

And now we have good status files and good html files. I'm not sure what else we're excluding there and if I mucked something else up in the process, but we didn't actually change any production stuff, just local copies. I'm gonna go try and kick some downstream jobs to see if I can get stuff to run now.

manually running this in a screen on an-launcher1002:

sudo -u analytics kerberos-run-command analytics /usr/local/bin/refinery-import-wikidata-page-history-dumps

This will take a while to bring in about 3T of data, and then I'll kick the airflow jobs. All good though, things are flowing as I'd expect. It'll take some time.

K, as a final update here, the pipeline is:

  • dumps runs (this was slow last couple months, ran into the partial second run)
  • dumpstatus.json shows "done" (this happened on snapshot1010 but wasn't synced)
  • XML imported into HDFS (this never triggered because above, and will recur if wikidata finishes dumps after the 20th of the month)
  • _SUCCESS_WIKIDATA is written
  • Airflow job imports wikidata into wmf.mediawiki_wikitext_history`

Right now we're at the last step, the job has been running for 9 hours and it usually takes 12. Once done, you should be able to see the data in the table.

I'll let others resolve the task as needed, because ideally we'd make some follow-ups to guard against the hard coded behavior I mentioned above.