Maniphest T119641

Split-brain strategy for services that use config managed by etcd
Open, LowPublic
Actions

Assigned To

None

Authored By

	aaron
	Nov 25 2015, 6:02 PM

Description

Scripts that use config that switches based on which datacenter is authoritative* need to handle the case where they end up on the minority side of an etcd split-brain, thus causing the config to get stale.

If a script, swiftrepl for example, was running for minutes or hours based on stale config, it might be syncing containers in the wrong direction. It might see new files that are only in the "destination" as needing to be deleted, when in fact they should be *created* in the "source", because source/destination are reversed due to stale config. This would cause data loss. To some extent, this is a problem with any config change when you have long-running scripts. Even for short scripts, it's still a problem in the split-brain case. We need to make sure scripts/services are acting on up-to-date (within X seconds) config or stop running in order to avoid corruption. In the worst split-brain, ops admins can't even stop the scripts or change the config for servers on the minority side, so they need to already have had handling for this case (by stopping). If X is high, we'd want to be careful to wait it out before a DC switchover.

Assuming etcd is over the WAN in 3 places and no single datacenter can cause a quorum to fail, there are a few strategies.

[method I] One strategy is to make sure:
a) The client using etcd for config uses quorum reads on startup (so they are consistent or fail)
b) The client, if long-running, periodically likewise rechecks the config
c) The client aborts when the above fail (rather than catching errors or using process cached config or something)

[method II] Use heartbeats into etcd dataspace and non-quorum reads. Clients would abort if the config it too stale (defined by X seconds). This could lower latency.
a) The client using etcd for config checks it on startup and aborts if stale
b) The client, if long-running, periodically likewise rechecks the config
c) The client aborts when the above fail (rather than catching errors or using process cached config or something)

For daemons, they need to restart when the etcd problems resolve, of course.

The most error-prone part seems to be periodic checks for long-running scripts (e.g. swiftrepl). It seems easy to fail to account for how long certain parts of a script might take. Maybe hacking a monitor daemon that kills etcd-dependent scripts if the config is stale would be less error-prone...though that raises the question of what kill level to use (what if SIGINT isn't enough?) and what might break with higher levels (e.g. kill -9).

*"authoritative" could mean "handles writes" or where things like local Cassandra quorums are directed, or where reads-for-writes go.

See also the following tasks which are about adopting Etcd in the current active-inactive model:

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
		Open		None	T119641 Split-brain strategy for services that use config managed by etcd

Event Timeline

aaron created this task.Nov 25 2015, 6:02 PM

aaron raised the priority of this task from to Medium.

aaron updated the task description. (Show Details)

aaron added projects: Performance-Team, Sustainability, Epic.

aaron added subscribers: • MZMcBride, intracer, Ricordisamoa and 17 others.

Side note: I know this is not on the scope of this ticket, but I feel very identified with db-eqiad.php and db-codfw.php, and long-running maintenance scripts using stale config.

aaron updated the task description. (Show Details)Nov 25 2015, 6:56 PM

aaron set Security to None.

The most error-prone part seems to be periodic checks for long-running scripts (e.g. swiftrepl). It seems easy to fail to account for how long certain parts of a script might take.

systemd has a heartbeat-based service monitoring facility. When enabled, the service has to periodically send a single UDP datagram to the socket stored in the $NOTIFY_SOCKET variable in its environment. Systemd can intervene and restart the service if too much time has passed since the last heartbeat. We can send a heartbeat every time we successfully retrieve configuration data from etcd.

Adapting this approach for Trusty would entail either some amount of custom work if we wanted the behavior to match systemd's, or we could simply have something closely analogous using existing tools. One possibility would be to use monit. Instead of sending a heartbeat packet, the process could touch a PID file, and we'd configure monit to check the file mtime:

check file foo_pid with path /run/foo.pid
  if timestamp > 10 seconds then exec "/usr/sbin/service foo restart"

See [[ https://mmonit.com/monit/documentation/monit.html#TIMESTAMP-TESTING

TIMESTAMP TESTING ]] in the monit manual.

ori lowered the priority of this task from Medium to Low.Nov 30 2015, 7:43 PM

ori moved this task from Inbox, needs triage to Backlog: Maintenance, non-prioritized on the Performance-Team board.

aaron added a project: Wikimedia-Multiple-active-datacenters.Aug 12 2016, 4:53 AM

Krinkle moved this task from Backlog: Maintenance, non-prioritized to Radar on the Performance-Team board.Dec 6 2016, 1:12 AM

Krinkle edited projects, added Multiple-active-datacenters archived; removed Wikimedia-Multiple-active-datacenters, Sustainability.May 3 2017, 7:37 PM

Krinkle edited projects, added Sustainability (MediaWiki-MultiDC); removed Multiple-active-datacenters archived.May 3 2017, 7:58 PM

aaron moved this task from Later to Radar on the Sustainability (MediaWiki-MultiDC) board.May 10 2017, 4:58 PM

Krinkle updated the task description. (Show Details)Aug 5 2017, 11:14 PM

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Aug 8 2017, 3:15 AM

Krinkle edited projects, added Sustainability; removed Sustainability (MediaWiki-MultiDC).Mar 17 2018, 12:37 AM

• Imarlier subscribed.Mar 17 2018, 1:52 AM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Apr 10 2018, 2:55 PM

RhinosF1 subscribed.Jan 25 2021, 8:33 PM

Krinkle removed a project: Performance-Team (Radar).Aug 6 2023, 10:21 PM

Krinkle removed subscribers: • Imarlier, • GWicke, • Gilles and 3 others.

Split-brain strategy for services that use config managed by etcdOpen, LowPublicActions

Description

Related ObjectsSearch...

Event Timeline

Split-brain strategy for services that use config managed by etcd
Open, LowPublic
Actions

Related Objects
Search...