Page MenuHomePhabricator

Building completion suggester indices seems to be much slower with the latest train rollout
Closed, ResolvedPublic

Description

Spent a good part of the day on this in IRC, copying some relevant info here:

  • dcausse started a completion suggester build this morning, it ran all day (14hours and still going)
  • Choosing a random day, 20170312, all completion suggester indices built in 8 hours
    • enwiki built 5.35M docs from 03:44:26 to 05:22:16
    • about 100 minutes, so 54,720 docs/minute
  • ebernhardson took a 5 minute tcpdump of all traffic to/from terbium to codfw cluster hosts
    • Indexing bulks is reporting an average of ~10ms
    • scrolls reporting an average of ~100ms, peaks ~200ms
    • 3088 /_bulk requests over 5 minutes, average content size of 41kB. This is for all 4 wikis currently indexing
    • Hard to filter packets to only one wiki, but assuming 1/4 thats 772 bulks per wiki, 77,200k docs per 5 minutes, 15.4k docs per minute per wiki
    • enwiki is 5 mil docs, 15.4k docs/minute would take 5.5 hours or more.
    • 15.4k docs/minute, vs 54k docs/minute previously. Where did the time go?
    • 772 bulks in 5 minutes per wiki is 2.57 round trips per second

Possible solutions:

  • batch sizes are very small. request and response cycle happens multiple times per second. Current size is only 100 docs per scroll bulk.
  • Figure out what happened to make this so much slower...

Event Timeline

The main takeaway here seems to be that before we were doing 54k docs/minute, and now we are doing 15.4k docs/minute. 54k docs/minute at 100 doc batches seems off though, thats a scroll bulk round trip from eqiad to codfw 540 times per minute, or 9 times per second. That leaves 111ms per scroll bulk, with two round trips taking up 60ms, leaving ~50ms to process the 100 items.

The rate now is ~2.5 times per second, 400ms per scroll bulk, taking ~360ms to process 100 items.

Ran some quick(ish) tests with cawiki trying different batch sizes. cawiki has 537533 docs.

docs per scrolltime taken
500016m43s535/sec32k/min
10023m31s380/sec22.8k/min

While unlikely it seemed worthwhile to look into how long it takes to process a batch of results from elasticsearch into the format we index into completion suggester. For this i took 5k docs and ran the same set of docs through 1.29.0-wmf.15 and 1.29.0-wmf.16 on wasat.codfw.wmnet:

Script to collect docs: P5090
Script to benchmark: P5091

branchtime to build 5kper doc
1.29.0-wmf.161.997s0.39ms
1.29.0-wmf.151.931s0.38ms

Decided to bench the scroll as well, since the three basic things that happen are: scroll, build, bulk insert. The bulks are coming back with 10ms of elasticsearch time, so it doesn't seem likely a problem. building was benchmarked above and showed no noticeable difference between wmf.15 and wmf.16. The took time provided is the total time to iterate the entire scroll, with the only work done per scroll echoing a . to console. All scrolls were done against cawiki_content index's page type. The size was set to 100 per call, same as default for building suggester

codfw docs: 537540
eqiad docs: 537540

from clusterto clustertestphp codetookes took
codfweqiadwmf.15P5092138.12s40.72s
codfweqiadwmf.15P5092214.35s46.92s
codfwcodfwwmf.16P5093305.05s236.00s
codfwcodfwwmf.16P5093337.47s267.18s
eqiadcodfwwmf.16P5093506.26s244.68s

Seems this might be the explanation for our slowdown. Of interest, after seeing these results i ran again to see how many batches were issued, because the number of . i saw was different for the two, in wmf.15 even though the scroll options specified a size of 100, only 1079 batches were issued which gives ~498 docs per scroll. wmf.16 on the other hand respected the request for 100 docs, and had 5376 requests.

Re-running wmf.16 with a batch size of 500:

from clusterto clustertesttookes took
codfwcodfwwmf.16-500batch203.18s142.92s
codfwcodfwwmf.16-5000batch168.8s116.8s

This used 1076 requests, so about the same as wmf.15 against es2. This is still 3x the processing time inside elasticsearch of wmf.15 talking to elasticsearch 2, but is noticeably better than the 100 item batches. 5k item batches shave off a bit more, although since i havn't re-run these tests it could just be jitter.

Poking around the elasticsearch docs i see that what we were doing previously, with 'scan', was disabling any sorting on the elasticsearch side. It looks like with the new scroll api to get the same behaviour we need to set the sort to _doc. Updating the wmf.16-500batch to use sort of _doc i get:

from clusterto clustertesttookestook
codfwcodfwwmf.16-500batch-sort_doc110.56s58.56s

The time spent inside elasticsearch is now only 50%, rather than 300%, more than 2.x.

Change 343820 had a related patch set uploaded (by EBernhardson):
[mediawiki/extensions/CirrusSearch] Improve speed of scrolling results in comp suggest build

https://gerrit.wikimedia.org/r/343820

Change 343820 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch] Improve speed of scrolling results in comp suggest build

https://gerrit.wikimedia.org/r/343820

Change 343966 had a related patch set uploaded (by EBernhardson):
[mediawiki/extensions/CirrusSearch] Improve speed of scrolling results in comp suggest build

https://gerrit.wikimedia.org/r/343966

Change 343966 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch] Improve speed of scrolling results in comp suggest build

https://gerrit.wikimedia.org/r/343966

Deskana triaged this task as Medium priority.