Page MenuHomePhabricator

Wikidata entity dumpers stuck with 100% CPU on snapshot1007
Closed, ResolvedPublic

Description

On snapshot1007:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME  COMMAND                                                                                                                                                                                                                                                                                                                                                                                                                                                     
60977 datasets  20   0 7195284 6.499g   3684 R 100.0 10.3   5777:51 php5 /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 6 --batch-size 1500 --format nt --flavor truthy-dump --no-cache                                                                                                                                                                                                                          
60983 datasets  20   0 7199524 6.502g   3676 R 100.0 10.4   5809:59 php5 /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 2 --sharding-factor 6 --batch-size 1500 --format nt --flavor truthy-dump --no-cache                                                                                                                                                                                                                          
60971 datasets  20   0 7199644 6.504g   3684 R  94.9 10.4   5821:09 php5 /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 6 --batch-size 1500 --format nt --flavor truthy-dump --no-cache                                                                                                                                                                                                                          
60980 datasets  20   0 7189028 6.491g   3684 R  94.9 10.3   5798:56 php5 /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 6 --batch-size 1500 --format nt --flavor truthy-dump --no-cache                                                                                                                                                                                                                          
60986 datasets  20   0 7198460 6.501g   3684 R  94.9 10.4   5784:49 php5 /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 4 --sharding-factor 6 --batch-size 1500 --format nt --flavor truthy-dump --no-cache                                                                                                                                                                                                                          
60989 datasets  20   0 7190612 6.493g   3684 R  94.9 10.3   5766:51 php5 /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 5 --sharding-factor 6 --batch-size 1500 --format nt --flavor truthy-dump --no-cache

They got pretty far before getting stuck:

$ tail -n1 dumpwikidatant-wikidata-20171122-truthy-BETA-*
==> dumpwikidatant-wikidata-20171122-truthy-BETA-0.log <==
Processed 6776250 entities.

==> dumpwikidatant-wikidata-20171122-truthy-BETA-1.log <==
Processed 6772414 entities.

==> dumpwikidatant-wikidata-20171122-truthy-BETA-2.log <==
Processed 6775991 entities.

==> dumpwikidatant-wikidata-20171122-truthy-BETA-3.log <==
Processed 6768160 entities.

==> dumpwikidatant-wikidata-20171122-truthy-BETA-4.log <==
Processed 6774085 entities.

==> dumpwikidatant-wikidata-20171122-truthy-BETA-5.log <==
Processed 6766093 entities.

Event Timeline

Here are three perf top screenshots in order, with a few minutes between the first and last one.

perf-top-1-done.png (250×770 px, 54 KB)

perf-top-2-done.png (447×812 px, 98 KB)

perf-top-3-done.png (754×892 px, 162 KB)

Mentioned in SAL (#wikimedia-operations) [2017-11-27T18:08:26Z] <hoo> Killed truthy nt dumpers stuck at 100% CPU (from last week) on snapshot1007 (T181385)

Not a single entity managed to get dumped since I created this ticket more than 7h ago, thus I killed all related processes now.
Given that we will have a new dump this week anyway, I wont bother re-starting it manually.

I just tested this briefly locally by dumping my wiki several times (up to 100 times), but I couldn't see any kind of memory leak (but again, I dumped the same entities several times).

Tomorrow I'll test this with actual data.

So I just ran one JSON dumper with --no-cache and one without on mwdebug1001 (with HHVM)… the results are baffling:

$ head -n10 dumpJson*.stderr 
==> dumpJson-no-cache.stderr <==
Dumping shard 0/4
memory_get_usage(): 18874368
Processed 23 entities.
memory_get_usage(): 25165824
Processed 41 entities.
memory_get_usage(): 33554432
Processed 60 entities.
memory_get_usage(): 35651584
Processed 81 entities.
memory_get_usage(): 37748736

==> dumpJson.stderr <==
Dumping shard 0/4
memory_get_usage(): 18874368
Processed 23 entities.
memory_get_usage(): 23068672
Processed 41 entities.
memory_get_usage(): 27262976
Processed 60 entities.
memory_get_usage(): 29360128
Processed 81 entities.
memory_get_usage(): 29360128

$ tail -n10 dumpJson*.stderr 
==> dumpJson-no-cache.stderr <==
memory_get_usage(): 553648128
Processed 19902 entities.
memory_get_usage(): 553648128
Processed 19924 entities.
memory_get_usage(): 553648128
Processed 19947 entities.
memory_get_usage(): 555745280
Processed 19977 entities.
memory_get_usage(): 555745280
Processed 20000 entities.

==> dumpJson.stderr <==
memory_get_usage(): 62914560
Processed 19902 entities.
memory_get_usage(): 62914560
Processed 19924 entities.
memory_get_usage(): 62914560
Processed 19947 entities.
memory_get_usage(): 62914560
Processed 19977 entities.
memory_get_usage(): 62914560
Processed 20000 entities.

I'll look into this more tomorrow… I have no idea how this could be happening.

Mentioned in SAL (#wikimedia-operations) [2017-11-28T23:58:16Z] <hoo> Ran scap pull on mwdebug1001 after T181385 related testing

Mentioned in SAL (#wikimedia-operations) [2017-12-01T00:04:47Z] <hoo> Killed all remaining Wikidata JSON/RDF dumpers, due to T181385. This means no dumps this week!

hoo raised the priority of this task from High to Unbreak Now!.Dec 1 2017, 12:05 AM

When changing CacheRetrievingEntityRevisionLookup to always use it's underlying EntityRevisionLookup ($this->lookup) on mwdebug1001, both the call with --no-cache and without it show very similar memory usage behavior.

When changing CacheRetrievingEntityRevisionLookup to always use it's underlying EntityRevisionLookup ($this->lookup) on mwdebug1001, both the call with --no-cache and without it show very similar memory usage behavior.

This suggest that removing the no-cache wont have any effect on Zend at all (as the hit rate on Zend is 0 anyway).

I tried digging into this some more, but couldn't find anything useful, yet. I skimmed the whole wmf.6 -> wmf.8 Wikidata and MediaWiki diff and couldn't see anything related/ or I manually reverted the related bits on mwdebug1001 without success.

Mentioned in SAL (#wikimedia-operations) [2017-12-01T01:46:52Z] <hoo> Ran scap pull on mwdebug1001 after T181385 testing

hoo renamed this task from Wikidata truthy nt dumpers stuck with 100% CPU on snapshot1007 to Wikidata entity dumpers stuck with 100% CPU on snapshot1007.Dec 1 2017, 2:47 PM

So I finally managed to track this down:

In Revision::getRevisionText, MainWANObjectCache::getWithSetCallback is used to retrieve revisions.

As of 4432e898be078eea73fbf072f5a383ec1ded0d67 WANObjectCache uses a StatsdDataFactoryInterface for certain kinds of (often very high volume) logging. Given MediaWiki uses a BufferingStatsdDataFactory per default, we basically try to accumulate all these stat hits until we eventually run out of memory.

Just commenting out $this->buffer[] = $entity; in BufferingStatsdDataFactory::produceStatsdData (locally on snapshot1001) immediately fixed this.

hoo added subscribers: demon, Krinkle, Addshore, Legoktm.

I don't see a clear way to fix, thus reassigning to Aaron.

  • We could either set a (even obsessively high) maximum buffer size in BufferingStatsdDataFactory.
  • We could remove the tracking from WANObjectCache :/
  • We could make sure the buffer in BufferingStatsdDataFactory is send out more often… this would probably require dirty changes to BufferingStatsdDataFactory or a decorator that takes care of this?

Mentioned in SAL (#wikimedia-operations) [2017-12-02T03:51:21Z] <hoo> Ran "scap pull" on snapshot1001, after final T181385 tests

Change 394758 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] Wikidata weekly json and rdf dumps disabled temporarily

https://gerrit.wikimedia.org/r/394758

Change 394758 merged by ArielGlenn:
[operations/puppet@production] Wikidata weekly json and rdf dumps disabled temporarily

https://gerrit.wikimedia.org/r/394758

How long do these run? The sample rate in config is set to be extremely low. So perhaps:

  • The buffering class buffers things that won't even be saved
  • The buffering could be disable in CLI mode

Change 394772 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Only send statsd data for WAN cache in non-CLI mode

https://gerrit.wikimedia.org/r/394772

Change 394779 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Try to opportunistically flush statsd data in maintenance scripts

https://gerrit.wikimedia.org/r/394779

Change 394779 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Try to opportunistically flush statsd data in maintenance scripts

https://gerrit.wikimedia.org/r/394779

This seems more general and preferable than the other patch, but leaving both here to look at.

How long do these run? The sample rate in config is set to be extremely low. So perhaps:

  • The buffering class buffers things that won't even be saved

That's definitely the case here, given that the sample rate is only applied later on.

Change 394772 merged by jenkins-bot:
[mediawiki/core@master] Only send statsd data for WAN cache in non-CLI mode

https://gerrit.wikimedia.org/r/394772

So I merged the patch that makes sure we only actually collect this data if not in CLI mode.

I also tried putting MediaWikiServices::getInstance()->getStatsdDataFactory()->setEnabled( false ); in DumpScript::__construct, but this only had limited effect. For that to properly work, BufferingStatsdDataFactory::produceStatsdData needs to normalize and set the entity's key before aborting. Any idea why that is @aaron?

Change 395055 had a related patch set uploaded (by Hoo man; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.10] Only send statsd data for WAN cache in non-CLI mode

https://gerrit.wikimedia.org/r/395055

Change 395055 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.10] Only send statsd data for WAN cache in non-CLI mode

https://gerrit.wikimedia.org/r/395055

Mentioned in SAL (#wikimedia-operations) [2017-12-04T18:35:16Z] <hoo@tin> Synchronized php-1.31.0-wmf.10/includes/objectcache/ObjectCache.php: Only send statsd data for WAN cache in non-CLI mode (T181385) (duration: 00m 44s)

Mentioned in SAL (#wikimedia-operations) [2017-12-04T18:40:40Z] <hoo> Ran scap pull on snapshot1001 (T181385)

Mentioned in SAL (#wikimedia-operations) [2017-12-04T18:46:40Z] <hoo> Started dumpwikidatajson.sh on snapshot1007 (T181385)

hoo lowered the priority of this task from Unbreak Now! to High.Dec 4 2017, 6:48 PM

Downgrading this, as this seems to be working again.

Keeping open unless we have re-enabled the cron after we saw this actually worked.

We had similar problems with long-running scripts and statsd here: T181716: forceSearchIndex.php hangs at the end of the process when running on large wikis. I think combinantion of statsd and long-running maintenance scripts is not working well with some recent change there. Hopefully limiting it to non-CLI will fix it.

hoo claimed this task.
hoo removed a project: Patch-For-Review.
hoo added a subscriber: aaron.

Dump crons are back (af1f7dabee931dbdb7366b7be1f93698f2c56108), so I expect dumps to arrive in time next week!

Change 394779 merged by jenkins-bot:
[mediawiki/core@master] Try to opportunistically flush statsd data in maintenance scripts

https://gerrit.wikimedia.org/r/394779