Page MenuHomePhabricator

Ensure getLagTimes.php is working properly
Closed, ResolvedPublic

Description

During the ops-perf sync up meeting last Wednesday, we looked at the dashboard for the getLagTimes data, https://grafana.wikimedia.org/dashboard/db/mediawiki-mysql-loadbalancer and it wasn't sure whether or not it is working properly due to the graphs being almost entirely empty.

This task is to verify the script works, is enabled in production, and is making its way to statsd/Graphite/Grafana correctly.

=> Warnings: https://logstash.wikimedia.org/goto/dee7ad13581009537748b0e19bce2d09

Event Timeline

Krinkle closed this task as Resolved.EditedAug 4 2017, 8:06 PM
Krinkle claimed this task.
Krinkle triaged this task as Medium priority.
Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

The script is part of MediaWiki core:

It gets run from cron, as ensured by mediawiki::maintenance in Puppet:

The dashboard in Grafana, while mostly zeroes, there are a few non-zero values. In the last 30 days, I only found non-zero values for extension or external stores, though.

Screen Shot 2017-08-04 at 13.06.34.png (1×1 px, 179 KB)

As for the core DBs, the zeroes do appear to be "real", as in, they are reported fresh each minute from the cron job. I verified on graphite1001 that the metric files are routinely written to (not stale impressions from statsd). And when a server is depooled, there is a gap (implied null) for that metric, instead of a zero.

Also checked on terbium, and did spot the process getting spawned every minute:

[20:01 UTC] krinkle at terbium.eqiad.wmnet in ~
$ ps aux | grep getLagTimes
www-data  8843  0.0  0.0   4340   760 ?        Ss   20:01   0:00 /bin/sh -c /usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null
www-data  8845  0.0  0.0  13236  3012 ?        S    20:01   0:00 /bin/bash /usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report
www-data  8853 26.0  0.1 318948 53180 ?        R    20:01   0:00 php5 /srv/mediawiki-staging/multiversion/MWScript.php maintenance/getLagTimes.php --wiki aawiki --report
jcrespo reopened this task as Open.EditedAug 18 2017, 9:31 AM
jcrespo subscribed.

I want to reopen this because I belive this script is causing DBConnection log spam, because it is trying to connect to aawiki on hosts that don't have such a wiki. I know it is terbium, and I know it is a maintenance script, but I am not 100% it is this one, but this is my best guess. Could you have a look at it?

Probably it is not interfering with the functionality, but it is creating 50K "Errors connectiong" evenly distributed, every day; the main annoyance is that it makes DBConnection log useless to debug other issues:
https://logstash.wikimedia.org/goto/f5e45ddd44e6646b9a4c26856aaaff33

(BTW, unrelated, I have plans to make this obsolete in the future by polling replication lag every second, every host. That, combined with mediawiki etcd-controlled pooling would make for more accurate measures. Stay tuned- but don't hold your breath).

Hm.. seems plausible indeed.

/maintenance/getLagTimes.php:

  • GetLagTimes::execute() -> MediaWikiServices::getDBLoadBalancerFactory() -> $lb->getLagTimes()
  • LoadBalancer::getLagTimes() -> LoadMonitor::getLagTimes() -> LoadMonitor::getServerStates( .. $domain = false )
  • $this->parent->openConnection( $i, $domain );

https://logstash.wikimedia.org/goto/dee7ad13581009537748b0e19bce2d09

Too bad Logstash entries for command-line scripts don't report which script it is running in, so we can't tell for sure what script on terbium is causing it. For entries from web requests we automatically add url and http_method, we could so something similar for maintenance scripts and add something like cli_maint_class or cli_argv.

Change 372766 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Avoid selectDB() call in LoadMonitor new connections

https://gerrit.wikimedia.org/r/372766

Change 372766 merged by jenkins-bot:
[mediawiki/core@master] rdbms: Avoid selectDB() call in LoadMonitor new connections

https://gerrit.wikimedia.org/r/372766

This is rolling out in the wmf.15 branch this week. Keeping this task open until we've verified that the warnings are gone (which I expect will happen on Thursday when it reaches the last deployment group.)