Maniphest T190425

GlobalPreferences deploy caused a significant increase in reads on s3
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	Niharika
	Mar 22 2018, 4:39 PM

Description

In T189806#4073107, @Marostegui wrote:

The deployment of: https://gerrit.wikimedia.org/r/#/c/420947/ matches a big increase on s3 servers:

17:35 maxsem@tin: Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/420947/ (duration: 01m 15s)

https://grafana.wikimedia.org/dashboard/db/mysql?panelId=8&fullscreen&orgId=1&var-dc=eqiad prometheus/ops&var-server=db1077&var-port=9104&from=1521637034143&to=1521681962143

https://grafana.wikimedia.org/dashboard/db/mysql?panelId=8&fullscreen&orgId=1&from=1521637034143&to=1521681962143&var-dc=eqiad prometheus/ops&var-server=db1078&var-port=9104

https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad prometheus/ops&var-server=db1077&var-port=9104&from=1521637034143&to=1521681962143

Can we revert this to see if this stops?

The patch has been reverted and the reads have dropped.

Details

Subject	Repo	Branch	Lines /-
Deploy GlobalPreferences to test wikis and mw.org (fourth time)	operations/mediawiki-config	master	5 -1
Don't force PreferencesFactory initialization	mediawiki/extensions/GlobalPreferences	master	0 -2
Don't initialize MediaWikiServices before extensions have been loaded	mediawiki/core	master	23 -21
Add instance cache for global prefs storage	mediawiki/extensions/GlobalPreferences	master	16 -3
Add runtime cache for central IDs	mediawiki/extensions/GlobalPreferences	master	15 -1
Add logging channel for preference stuff	operations/mediawiki-config	master	1 -0
Add logging	mediawiki/extensions/GlobalPreferences	master	9 -1
Log DefaultPreferencesFactory usage	mediawiki/core	master	9 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Niharika	T139145 global disabling of compact language list
Resolved	None	T16950 [Epic] Support global preferences on Wikimedia wikis
Resolved	• TBolliger	T184121 Deploy checklist for GlobalPreferences on production
Resolved	MaxSem	T189806 Deploy GlobalPrefs on production
Resolved	MaxSem	T190425 GlobalPreferences deploy caused a significant increase in reads on s3

Event Timeline

Niharika triaged this task as High priority.Mar 22 2018, 4:39 PM

Niharika created this task.

• TBolliger assigned this task to Samwilson.Mar 22 2018, 5:38 PM

• TBolliger edited projects, added Community-Tech-Sprint; removed Community-Tech.

jcrespo added a subscriber: Volans.Mar 22 2018, 5:47 PM

Grabbing this because I unlike Sam have prod access for investigation.

Samwilson moved this task from Backlog to Defects on the MediaWiki-extensions-GlobalPreferences board.Mar 23 2018, 6:12 AM

We are logging SQL queries for testwiki (testwiki.log). I've done some slicing throught these, however the enabling on GP roughly coincides with train deployment of a new MW version, making precise analysis hard. Specifically, there's a huge increase in external store queries (due to some MCR work? due to somehow increased traffic?):

maxsem@mwlog1001:~$ for file in *qc.log; do total=`wc -l $file`; count=`grep 'SELECT  blob_text' $file | wc -l`; echo "$file $count $total" ; done
21qc.log 81270 305302 21qc.log
22qc.log 3714588 4874350 22qc.log
23qc.log 1926239 2836952 23qc.log

And a corresponding increase in readonly checks:

maxsem@mwlog1001:~$ for file in *qc.log; do total=`wc -l $file`; count=`grep "SHOW GLOBAL VARIABLES LIKE 'read_only'" $file | wc -l`; echo "$file $count $total" ; done
21qc.log 14419 305302 21qc.log
22qc.log 658799 4874350 22qc.log
23qc.log 326875 2836952 23qc.log

Neither of these dropped after GP was disabled, so clearly a wrong lead. @Marostegui, is there a record of which queries were causing the slowdown somewhere in Tendril?

Hey @MaxSem - thanks for digging into this.
I am not sure I am understanding the relationship between the train and this issue.
So, what I saw is that this config change matches, exactly the spike on the graphs below:

17:35 maxsem@tin: Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/420947/ (duration: 01m 15s)

https://grafana.wikimedia.org/dashboard/db/mysql?panelId=8&fullscreen&orgId=1&var-dc=eqiad prometheus/ops&var-server=db1077&var-port=9104&from=1521637034143&to=1521681962143
https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad prometheus/ops&var-server=db1077&var-port=9104&from=1521637034143&to=1521681962143

(https://dev.mysql.com/doc/refman/5.7/en/server-status-variables.html#statvar_Handler_read_next) which matches with the huge spike on the first graph (sorts).

As soon as that was reverted, the values came back to normal values. So not sure why the train is involved here.
Thanks for investigating!

@Marostegui (CC @jcrespo) can we get an idea of which queries were causing the slowdown? This is getting hard to debug without that data.
If it's not possible to get the data without the code being deployed, can we go ahead and re-enable this on testwiki for a couple of hours to observe?

MaxSem moved this task from Ready to In Development on the Community-Tech-Sprint board.Mar 28 2018, 11:24 PM

🔮

@Niharika as you might know, our DBAs are out for the end of this week, do you think this can wait Monday? If not let me know and I'll try to have a look although I'm missing some context on what analysis was already done.

Quick answer: the queries were not slow (or not yet) but they are probably not correct as they were scanning lots and lots of rows.

Thankfully it was only enabled on testwiki (and it was already visible on the graphs), otherwise it would have caused lots of more issues.

@Volans Yes, this can wait until Monday. We need to take a look at the queries which were causing this and I suspect we'll have to re-enable this on test wiki for a short while to get that data.

@MaxSem: Would it be worth doing a temporary deployment to test wiki to take a closer look at the queries, or would it be possible to recreate the issue on Beta Labs?

@MaxSem: Want to schedule a temporary deployment for sometime next week so that we can get more data?

I'll discuss the action plan with Sam.

In T190425#4110442, @MaxSem wrote:

I'll discuss the action plan with Sam.

Max, can you also mention the action plan here? It's a good idea to have everyone on the same page.
Thanks.

We decided to add more logging as a starting point.

Change 425199 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/core@master] Log DefaultPreferencesFactory usage

https://gerrit.wikimedia.org/r/425199

gerritbot added a project: Patch-For-Review.Apr 9 2018, 9:37 PM

Change 425201 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/extensions/GlobalPreferences@master] Add logging

https://gerrit.wikimedia.org/r/425201

Change 425199 merged by jenkins-bot:
[mediawiki/core@master] Log DefaultPreferencesFactory usage

https://gerrit.wikimedia.org/r/425199

Change 425210 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Add logging channel for preference stuff

https://gerrit.wikimedia.org/r/425210

Change 425201 merged by jenkins-bot:
[mediawiki/extensions/GlobalPreferences@master] Add logging

https://gerrit.wikimedia.org/r/425201

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2018-04-10 (1.31.0-wmf.29)).Apr 10 2018, 1:00 AM

@Marostegui and @jcrespo, we've fixed the other problems and introduced logging to pinpoint this bug. I'd like to try deploying to group0 wikis again to see how it works now and collect some logs - is that fine by you?

Can you specify which wikis are in group0?

In T190425#4121510, @Marostegui wrote:

Can you specify which wikis are in group0?

mediaWiki.org; test.wikipedia.org; test2.wikipedia.org for this task.

In T190425#4121521, @Niharika wrote:

In T190425#4121510, @Marostegui wrote:

Can you specify which wikis are in group0?

mediaWiki.org; test.wikipedia.org; test2.wikipedia.org for this task.

I'm fine with that, we should be able to revert if we see issues again quickly

@Marostegui Thank you. We'll keep a watch on tendril and logstash. We'll revert if we see issues.

Check also the mysql graphs for the s3 slaves.

Change 425210 merged by jenkins-bot:
[operations/mediawiki-config@master] Add logging channel for preference stuff

https://gerrit.wikimedia.org/r/425210

We tried running it for ~1 hour, got the same increase in DB load. The new logging is available in the preferences channel. @Samwilson says he has some ideas.

Change 425740 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[mediawiki/extensions/GlobalPreferences@master] Add instance cache for global prefs storage

https://gerrit.wikimedia.org/r/425740

So it looks to me like (for a logged-in user's request) the user three times: once when loading the session, once when setting $wgUser, and once in the ResourceLoaderContext.

When GP is enabled, each of these adds an additional database query, and one additional one because GP has no equivalent to the User class' override cache.

(I'm no doubt over simplifying things, but anyway this is about that additional query.)

So, we could add a runtime cache to GP, so it only fetches things once? Something like: https://gerrit.wikimedia.org/r/#/c/425740/
but I'm not really sure that one additional query is what's causing that huge increase in reads. Seems excessive, doesn't it?

So, sorry if I got your hopes up! I had been thinking earlier that there was some more obvious magic about what's going on in User::loadOptions(), but there isn't.

This measure is definitely good, however the overload happens on the local database, not centralauth.

We are fans of caching! :-)

Our worry is, if some of the least used wikis create this increase, when scaling to large wikis, that could be an outage (load of group0 is less than 1% of the total wikis). Hope the context of our fears is clear.

Change 426841 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[mediawiki/extensions/GlobalPreferences@master] Add runtime cache for central IDs

https://gerrit.wikimedia.org/r/426841

Could this be related to looking up the central ID? I've attempted to reproduce the bug locally by running 1000 requests, and do see considerable increase in Handler_read_next:

Without GP:

With GP:

With GP (patched):

So the patch https://gerrit.wikimedia.org/r/#/c/426841 does improve things, but the numbers are still higher (sorry, the graphs above have different Y-axis scales). And of course this is all very rough (single DB server, one user).

I wouldn't have thought queries like this could be a possible culprit:

SELECT /* LocalIdLookup::lookupUserNames  */  user_id,user_name  FROM `user`    WHERE user_name = 'Admin'

Oh, but if both of the above patches are applied, things look a bit better (still higher):

handlerreadnext_withGP_2patches.png (398×531 px, 21 KB)

Change 426841 merged by jenkins-bot:
[mediawiki/extensions/GlobalPreferences@master] Add runtime cache for central IDs

https://gerrit.wikimedia.org/r/426841

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)).Apr 18 2018, 9:00 PM

Change 425740 abandoned by Samwilson:
Add instance cache for global prefs storage

Reason:
Is being done in https://gerrit.wikimedia.org/r/#/c/427544/ in a better way.

https://gerrit.wikimedia.org/r/425740

With caching in place, we will try another time at 15:00 SF this Wednesday.

Снимок экрана 2018-04-25 в 18.16.45.png (1×1 px, 177 KB)

Nah, still noticeable load.

@MaxSem Noticeable, yes. But is it temporary?

For my analysis of last week's testing, I used a different strategy than before: instead of comparing logs for time periods of the same length, I compared logs with the same line count.

Both files in comparison had 2149813 lines, however the share of DB queries in them was wildly different:

maxsem@mwlog1001:~$ wc -l t2*
   461628 t2g.log     # With GP
   285585 t2ng.log    # Without GP

which by itself suggests a difference in DB activity per request. Request count:

maxsem@mwlog1001:~$ grep 'DEBUG: Start request' t1ng.log | wc -l
21104    # Without GP
maxsem@mwlog1001:~$ grep 'DEBUG: Start request' t1g.log | wc -l
13029    # With GP

The most popular queries without GP:

maxsem@mwlog1001:~$ sed -r "s/'.*?'/?/" t2ng.log | sed -r "s/\\b\\d \\b/?/" | sed -r "s/blobs_cluster\\d /blobs_cluster?/" | sort | uniq -c | sort -rn
  35334 SELECT  page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang  FROM `page`    WHERE page_namespace = ?  LIMIT ?
  22423 SELECT  blob_text  FROM `blobs_cluster?`    WHERE blob_id = ?  LIMIT ?
  20339 SELECT  page_touched,pp_propname  FROM `page` LEFT JOIN `page_props` ON ((page_id=pp_page) AND pp_propname = ?  LIMIT ?
  18387 BEGIN
  16585 SELECT ts FROM heartbeat.heartbeat WHERE shard = ? ORDER BY ts DESC LIMIT ?
  10306 SELECT  user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount  FROM `user`    WHERE user_id = ?  LIMIT ?
   9910 SELECT  page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang  FROM `page`    WHERE page_id = ?  LIMIT ?
   9399 SELECT  ug_user,ug_group,ug_expiry  FROM `user_groups`    WHERE ug_user = ?
   9301 SELECT  user_id  FROM `user`    WHERE user_name = ?  LIMIT ?
   8810 SELECT  up_property,up_value  FROM `user_properties`    WHERE up_user = ?
   8169 SELECT  page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model,page_lang  FROM `page`    WHERE page_namespace = ?  LIMIT ?
   7929 SELECT  pp_value  FROM `page_props`    WHERE pp_propname = ?  LIMIT ?
   5561 COMMIT
   4475 SELECT  rev_id,rev_page,rev_text_id,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,user_name  FROM `revision` LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) INNER JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((rev_user != ?) AND (user_id = rev_user))   WHERE page_namespace = ? AND (rev_id=page_latest)  LIMIT 1
   4296 SHOW GLOBAL VARIABLES LIKE ?
   4064 SHOW SESSION VARIABLES LIKE ?
...

While with GP:

maxsem@mwlog1001:~$ sed -r "s/'.*?'/?/" t2g.log | sed -r "s/\\b[0-9] \\b/?/" | sed -r "s/blobs_cluster[0-9] /blobs_cluster?/" | sort | uniq -c | sort -rn | head -n 100
 213026 SELECT  blob_text  FROM `blobs_cluster?`    WHERE blob_id = ?  LIMIT ?
  24988 SELECT  page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang  FROM `page`    WHERE page_namespace = ?  LIMIT ?
  22024 SELECT  blob_text  FROM `blobs`    WHERE blob_id = ?  LIMIT ?
  15009 BEGIN
  14067 SELECT ts FROM heartbeat.heartbeat WHERE shard = ? ORDER BY ts DESC LIMIT ?
  12765 SELECT  page_touched,pp_propname  FROM `page` LEFT JOIN `page_props` ON ((page_id=pp_page) AND pp_propname = ?  LIMIT ?
  10010 SELECT  user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount  FROM `user`    WHERE user_id = ?  LIMIT ?
   9138 SELECT  page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang  FROM `page`    WHERE page_id = ?  LIMIT ?
   8676 SELECT  ug_user,ug_group,ug_expiry  FROM `user_groups`    WHERE ug_user = ?
   8556 SELECT  up_property,up_value  FROM `user_properties`    WHERE up_user = ?
   7490 SELECT  user_id  FROM `user`    WHERE user_name = ?  LIMIT ?
   6715 SELECT  page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model,page_lang  FROM `page`    WHERE page_namespace = ?  LIMIT ?
   6393 SELECT  pp_value  FROM `page_props`    WHERE pp_propname = ?  LIMIT ?
   4715 COMMIT
...

Conclusion: GlobalPreferences breaks page blob caching. Which returns us to that One Line Of Doom that we can't remove due to T153256: Unable to overwrite services using MediaWikiServices hook.

Change 430012 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/core@master] Don't initialize MediaWikiServices before extensions have been loaded

https://gerrit.wikimedia.org/r/430012

Krinkle mentioned this in T193008: MediaWiki\MediaWikiServices::resetChildProcessServices doesn't reset database connection state.May 2 2018, 8:22 PM

Taking this out of the sprint and assigning to @kaldari to find a DBA to code review and test this patch. If this patch does not alleviate these database spikes, we need to have a meeting to discuss the future of this project.

• TBolliger moved this task from New & TBD Tickets to Older: Tracking Work by Others on the Community-Tech board.May 8 2018, 10:04 PM

Patch is merged now.

@kaldari @MaxSem What's the next step? Another test deploy? (after merging https://gerrit.wikimedia.org/r/#/c/432045/)

Change 402413 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/extensions/GlobalPreferences@master] Don't force PreferencesFactory initialization

https://gerrit.wikimedia.org/r/402413

Above is the second part of the fix.

Change 430012 merged by jenkins-bot:
[mediawiki/core@master] Don't initialize MediaWikiServices before extensions have been loaded

https://gerrit.wikimedia.org/r/430012

ReleaseTaggerBot edited projects, added MW-1.32-notes (WMF-deploy-2018-05-15 (1.32.0-wmf.4)); removed MW-1.32-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)).May 9 2018, 7:00 PM

Change 402413 merged by jenkins-bot:
[mediawiki/extensions/GlobalPreferences@master] Don't force PreferencesFactory initialization

https://gerrit.wikimedia.org/r/402413

Let's try another test deployment, say on Tuesday after the train?

In T190425#4198522, @MaxSem wrote:

Let's try another test deployment, say on Tuesday after the train?

1. Can you schedule that?

(was scheduled yesterday)

Change 433149 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[operations/mediawiki-config@master] Deploy GlobalPreferences to test wikis and mw.org (forth time)

https://gerrit.wikimedia.org/r/433149

Change 433149 merged by jenkins-bot:
[operations/mediawiki-config@master] Deploy GlobalPreferences to test wikis and mw.org (fourth time)

https://gerrit.wikimedia.org/r/433149

Mentioned in SAL (#wikimedia-operations) [2018-05-15T22:39:09Z] <samwilson@tin> Synchronized wmf-config/InitialiseSettings.php: Deploying GlobalPreferences T190425 (duration: 01m 21s)

During the deployment, we saw no increase in DB load, so we decided to leave it live overnight. In case something goes wrong, it can be reverted by deploying https://gerrit.wikimedia.org/r/#/c/433291/

@Marostegui or @jcrespo, does it look fine now?

It is looking a lot better now. I would suggest to wait for 24h and then deploy to some more wikis and see how it goes.

@MaxSem can you please update this ticket? Can we close as resolved?

• TBolliger moved this task from In Development to Needs Review/Feedback on the Community-Tech-Sprint board.May 18 2018, 12:10 PM

\m/ 😂 \m/

Nice work guys!

• TBolliger moved this task from Q1 2018-19 to Archive on the Community-Tech-Sprint board.May 30 2018, 7:48 PM

Niharika edited projects, added Community-Tech; removed Community-Tech-Sprint.Jun 26 2018, 11:54 PM

	F17313362: Снимок экрана 2018-04-25 в 18.16.45.png
	Apr 26 2018, 1:18 AM

	F17002223: handlerreadnext_withGP_2patches.png
	Apr 16 2018, 4:20 AM

	F17001628: handlerreadnext_withGP_patched.png
	Apr 16 2018, 4:20 AM

	F17001630: handlerreadnext_withoutGP.png
	Apr 16 2018, 4:20 AM

GlobalPreferences deploy caused a significant increase in reads on s3Closed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

🔮

GlobalPreferences deploy caused a significant increase in reads on s3
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...