Building completion suggester indices seems to be much slower with the latest train rollout
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Mar 20 2017, 11:16 PM

Description

Spent a good part of the day on this in IRC, copying some relevant info here:

dcausse started a completion suggester build this morning, it ran all day (14hours and still going)
Choosing a random day, 20170312, all completion suggester indices built in 8 hours
- enwiki built 5.35M docs from 03:44:26 to 05:22:16
- about 100 minutes, so 54,720 docs/minute
ebernhardson took a 5 minute tcpdump of all traffic to/from terbium to codfw cluster hosts
- Indexing bulks is reporting an average of ~10ms
- scrolls reporting an average of ~100ms, peaks ~200ms
- 3088 /_bulk requests over 5 minutes, average content size of 41kB. This is for all 4 wikis currently indexing
- Hard to filter packets to only one wiki, but assuming 1/4 thats 772 bulks per wiki, 77,200k docs per 5 minutes, 15.4k docs per minute per wiki
- enwiki is 5 mil docs, 15.4k docs/minute would take 5.5 hours or more.
- 15.4k docs/minute, vs 54k docs/minute previously. Where did the time go?
- 772 bulks in 5 minutes per wiki is 2.57 round trips per second

Possible solutions:

batch sizes are very small. request and response cycle happens multiple times per second. Current size is only 100 docs per scroll bulk.
Figure out what happened to make this so much slower...

Details

	Subject	Repo	Branch	Lines /-
	Improve speed of scrolling results in comp suggest build	mediawiki/extensions/CirrusSearch	wmf/1.29.0-wmf.16	4 -2
	Improve speed of scrolling results in comp suggest build	mediawiki/extensions/CirrusSearch	master	5 -8

Customize query in gerrit

Related Objects

Mentioned Here: P5092 benchScroll-wmf.15.php
P5093 benchScroll-wmf.16.php
P5090 collect docs to benchmark building completion suggester docs
P5091 benchmark building completion docs

Event Timeline

EBernhardson created this task.Mar 20 2017, 11:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2017, 11:16 PM

EBernhardson added a project: Discovery-Search (Current work).Mar 20 2017, 11:17 PM

EBernhardson updated the task description. (Show Details)Mar 20 2017, 11:21 PM

The main takeaway here seems to be that before we were doing 54k docs/minute, and now we are doing 15.4k docs/minute. 54k docs/minute at 100 doc batches seems off though, thats a scroll bulk round trip from eqiad to codfw 540 times per minute, or 9 times per second. That leaves 111ms per scroll bulk, with two round trips taking up 60ms, leaving ~50ms to process the 100 items.

The rate now is ~2.5 times per second, 400ms per scroll bulk, taking ~360ms to process 100 items.

Ran some quick(ish) tests with cawiki trying different batch sizes. cawiki has 537533 docs.

docs per scroll	time taken
5000	16m43s	535/sec	32k/min
100	23m31s	380/sec	22.8k/min

While unlikely it seemed worthwhile to look into how long it takes to process a batch of results from elasticsearch into the format we index into completion suggester. For this i took 5k docs and ran the same set of docs through 1.29.0-wmf.15 and 1.29.0-wmf.16 on wasat.codfw.wmnet:

Script to collect docs: P5090
Script to benchmark: P5091

branch	time to build 5k	per doc
1.29.0-wmf.16	1.997s	0.39ms
1.29.0-wmf.15	1.931s	0.38ms

Decided to bench the scroll as well, since the three basic things that happen are: scroll, build, bulk insert. The bulks are coming back with 10ms of elasticsearch time, so it doesn't seem likely a problem. building was benchmarked above and showed no noticeable difference between wmf.15 and wmf.16. The took time provided is the total time to iterate the entire scroll, with the only work done per scroll echoing a . to console. All scrolls were done against cawiki_content index's page type. The size was set to 100 per call, same as default for building suggester

codfw docs: 537540
eqiad docs: 537540

from cluster	to cluster	test	php code	took	es took
codfw	eqiad	wmf.15	P5092	138.12s	40.72s
codfw	eqiad	wmf.15	P5092	214.35s	46.92s
codfw	codfw	wmf.16	P5093	305.05s	236.00s
codfw	codfw	wmf.16	P5093	337.47s	267.18s
eqiad	codfw	wmf.16	P5093	506.26s	244.68s

Seems this might be the explanation for our slowdown. Of interest, after seeing these results i ran again to see how many batches were issued, because the number of . i saw was different for the two, in wmf.15 even though the scroll options specified a size of 100, only 1079 batches were issued which gives ~498 docs per scroll. wmf.16 on the other hand respected the request for 100 docs, and had 5376 requests.

Re-running wmf.16 with a batch size of 500:

from cluster	to cluster	test	took	es took
codfw	codfw	wmf.16-500batch	203.18s	142.92s
codfw	codfw	wmf.16-5000batch	168.8s	116.8s

This used 1076 requests, so about the same as wmf.15 against es2. This is still 3x the processing time inside elasticsearch of wmf.15 talking to elasticsearch 2, but is noticeably better than the 100 item batches. 5k item batches shave off a bit more, although since i havn't re-run these tests it could just be jitter.

EBernhardson added a subscriber: dcausse.Mar 21 2017, 1:52 AM

Poking around the elasticsearch docs i see that what we were doing previously, with 'scan', was disabling any sorting on the elasticsearch side. It looks like with the new scroll api to get the same behaviour we need to set the sort to _doc. Updating the wmf.16-500batch to use sort of _doc i get:

from cluster	to cluster	test	took	estook
codfw	codfw	wmf.16-500batch-sort_doc	110.56s	58.56s

The time spent inside elasticsearch is now only 50%, rather than 300%, more than 2.x.

Change 343820 had a related patch set uploaded (by EBernhardson):
[mediawiki/extensions/CirrusSearch] Improve speed of scrolling results in comp suggest build

https://gerrit.wikimedia.org/r/343820

gerritbot added a project: Patch-For-Review.Mar 21 2017, 5:00 AM

dcausse assigned this task to EBernhardson.Mar 21 2017, 10:12 AM

dcausse moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.

Change 343820 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch] Improve speed of scrolling results in comp suggest build

https://gerrit.wikimedia.org/r/343820

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2017-03-28_(1.29.0-wmf.18)).Mar 21 2017, 6:01 PM

Change 343966 had a related patch set uploaded (by EBernhardson):
[mediawiki/extensions/CirrusSearch] Improve speed of scrolling results in comp suggest build

https://gerrit.wikimedia.org/r/343966

Change 343966 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch] Improve speed of scrolling results in comp suggest build

https://gerrit.wikimedia.org/r/343966

ReleaseTaggerBot edited projects, added MW-1.29-release (WMF-deploy-2017-03-14_(1.29.0-wmf.16)); removed MW-1.29-release (WMF-deploy-2017-03-28_(1.29.0-wmf.18)).Mar 22 2017, 12:00 AM

• Deskana closed this task as Resolved.Mar 22 2017, 11:22 AM

• Deskana triaged this task as Medium priority.