Ensure getLagTimes.php is working properly
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Aug 4 2017, 7:27 PM

Description

During the ops-perf sync up meeting last Wednesday, we looked at the dashboard for the getLagTimes data, https://grafana.wikimedia.org/dashboard/db/mediawiki-mysql-loadbalancer and it wasn't sure whether or not it is working properly due to the graphs being almost entirely empty.

This task is to verify the script works, is enabled in production, and is making its way to statsd/Graphite/Grafana correctly.

=> Warnings: https://logstash.wikimedia.org/goto/dee7ad13581009537748b0e19bce2d09

Details

	Subject	Repo	Branch	Lines /-
	rdbms: Avoid selectDB() call in LoadMonitor new connections	mediawiki/core	master	1 -1

Customize query in gerrit

Related Objects

Mentioned Here: rMWa7f993c7f256: Improve getLagTimes.php output and add statsD flag
rMW5a668d5d107b: Send integer ms to DB lag time guage instead of seconds
T149210: Set cron script to dump MediaWiki DB lag times into statsd

Event Timeline

Krinkle created this task.Aug 4 2017, 7:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 4 2017, 7:27 PM

The script is part of MediaWiki core:

It gets run from cron, as ensured by mediawiki::maintenance in Puppet:

The dashboard in Grafana, while mostly zeroes, there are a few non-zero values. In the last 30 days, I only found non-zero values for extension or external stores, though.

Screen Shot 2017-08-04 at 13.06.34.png (1×1 px, 179 KB)

As for the core DBs, the zeroes do appear to be "real", as in, they are reported fresh each minute from the cron job. I verified on graphite1001 that the metric files are routinely written to (not stale impressions from statsd). And when a server is depooled, there is a gap (implied null) for that metric, instead of a zero.

Also checked on terbium, and did spot the process getting spawned every minute:

[20:01 UTC] krinkle at terbium.eqiad.wmnet in ~
$ ps aux | grep getLagTimes
www-data  8843  0.0  0.0   4340   760 ?        Ss   20:01   0:00 /bin/sh -c /usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null
www-data  8845  0.0  0.0  13236  3012 ?        S    20:01   0:00 /bin/bash /usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report
www-data  8853 26.0  0.1 318948 53180 ?        R    20:01   0:00 php5 /srv/mediawiki-staging/multiversion/MWScript.php maintenance/getLagTimes.php --wiki aawiki --report

I want to reopen this because I belive this script is causing DBConnection log spam, because it is trying to connect to aawiki on hosts that don't have such a wiki. I know it is terbium, and I know it is a maintenance script, but I am not 100% it is this one, but this is my best guess. Could you have a look at it?

Probably it is not interfering with the functionality, but it is creating 50K "Errors connectiong" evenly distributed, every day; the main annoyance is that it makes DBConnection log useless to debug other issues:
https://logstash.wikimedia.org/goto/f5e45ddd44e6646b9a4c26856aaaff33

(BTW, unrelated, I have plans to make this obsolete in the future by polling replication lag every second, every host. That, combined with mediawiki etcd-controlled pooling would make for more accurate measures. Stay tuned- but don't hold your breath).

• Marostegui subscribed.Aug 18 2017, 9:37 AM

Hm.. seems plausible indeed.

/maintenance/getLagTimes.php:

GetLagTimes::execute() -> MediaWikiServices::getDBLoadBalancerFactory() -> $lb->getLagTimes()
LoadBalancer::getLagTimes() -> LoadMonitor::getLagTimes() -> LoadMonitor::getServerStates( .. $domain = false )
$this->parent->openConnection( $i, $domain );

https://logstash.wikimedia.org/goto/dee7ad13581009537748b0e19bce2d09

Too bad Logstash entries for command-line scripts don't report which script it is running in, so we can't tell for sure what script on terbium is causing it. For entries from web requests we automatically add url and http_method, we could so something similar for maintenance scripts and add something like cli_maint_class or cli_argv.

Change 372766 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Avoid selectDB() call in LoadMonitor new connections

https://gerrit.wikimedia.org/r/372766

gerritbot added a project: Patch-For-Review.Aug 19 2017, 5:49 PM

Change 372766 merged by jenkins-bot:
[mediawiki/core@master] rdbms: Avoid selectDB() call in LoadMonitor new connections

https://gerrit.wikimedia.org/r/372766

jcrespo awarded a token.Aug 22 2017, 4:56 PM

ReleaseTaggerBot added a project: MW-1.30-release-notes.Aug 22 2017, 5:00 PM

This is rolling out in the wmf.15 branch this week. Keeping this task open until we've verified that the warnings are gone (which I expect will happen on Thursday when it reaches the last deployment group.)

Krinkle updated the task description. (Show Details)Aug 23 2017, 7:22 PM

This certainly is working, I would close it as resolved: https://logstash.wikimedia.org/goto/3d73a71beb3d6fa2bd5f8740b0f47fab Thank you!

aaron closed this task as Resolved.Aug 25 2017, 8:03 PM

	F8948332: Screen Shot 2017-08-04 at 13.06.34.png
	Aug 4 2017, 8:06 PM

Ensure getLagTimes.php is working properlyClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Ensure getLagTimes.php is working properly
Closed, ResolvedPublic
Actions