Scripts that use config that switches based on which datacenter is authoritative* need to handle the case where they end up on the minority side of an etcd split-brain, thus causing the config to get stale.
If a script, swiftrepl for example, was running for minutes or hours based on stale config, it might be syncing containers in the wrong direction. It might see new files that are only in the "destination" as needing to be deleted, when in fact they should be *created* in the "source", because source/destination are reversed due to stale config. This would cause data loss. To some extent, this is a problem with any config change when you have long-running scripts. Even for short scripts, it's still a problem in the split-brain case. We need to make sure scripts/services are acting on up-to-date (within X seconds) config or stop running in order to avoid corruption. In the worst split-brain, ops admins can't even stop the scripts or change the config for servers on the minority side, so they need to already have had handling for this case (by stopping). If X is high, we'd want to be careful to wait it out before a DC switchover.
Assuming etcd is over the WAN in 3 places and no single datacenter can cause a quorum to fail, there are a few strategies.
[method I] One strategy is to make sure:
a) The client using etcd for config uses quorum reads on startup (so they are consistent or fail)
b) The client, if long-running, periodically likewise rechecks the config
c) The client aborts when the above fail (rather than catching errors or using process cached config or something)
[method II] Use heartbeats into etcd dataspace and non-quorum reads. Clients would abort if the config it too stale (defined by X seconds). This could lower latency.
a) The client using etcd for config checks it on startup and aborts if stale
b) The client, if long-running, periodically likewise rechecks the config
c) The client aborts when the above fail (rather than catching errors or using process cached config or something)
For daemons, they need to restart when the etcd problems resolve, of course.
The most error-prone part seems to be periodic checks for long-running scripts (e.g. swiftrepl). It seems easy to fail to account for how long certain parts of a script might take. Maybe hacking a monitor daemon that kills etcd-dependent scripts if the config is stale would be less error-prone...though that raises the question of what kill level to use (what if SIGINT isn't enough?) and what might break with higher levels (e.g. kill -9).
*"authoritative" could mean "handles writes" or where things like local Cassandra quorums are directed, or where reads-for-writes go.
See also the following tasks which are about adopting Etcd in the current active-inactive model: