Most of MediaWiki was written with the assumption that MediaWiki and its data stores are collocated and connected via reliable, low-latency network links. Until recently, MediaWiki had minimal facilities for maintaining consistency and partition tolerance across wide-area network links. As a result, although the Wikimedia Foundation operates data centers in multiple locations, we only run MediaWiki in one location at any one time.
This has several practical consequences: first, we are not as fault-tolerant as we'd like to be. We have a secondary data center with enough capacity to serve our traffic in case our primary datacenter goes down, but it is in cold standby, meaning it takes some time (and some manual effort) to get it running. Second, site performance is poor for logged-in users that are geographically remote from Ashburn, Virginia, due to the time it takes to transmit and receive data across long-distance links. Thirdly, in some basic cases, like parsing pages, the master database must be up, leading to a SPOF.
It's going to take a lot of work to fix this completely, but we are getting closer to being able to serve some traffic from secondary datacenter. Specifically, we would like to serve "reads" -- requests that don't require a master database connection -- from a secondary datacenter.
In order to serve reads from a different datacenter, we need to be able to predict which incoming requests will modify data, so that we can route them accordingly. We need to be able to make this determination at the edge -- i.e., the outermost layers of the infrastructure, so it cannot be complicated or slow.
The solution we have is to use the HTTP request method (T91820): GETs/HEADs are read-only, while POSTs are not. This was already true for most cases, but there is a long tail of actions with side-effects that are done via GET, such as purge, rollback, markpatrolled.
This task mostly involves fixing DBPerformance log warnings. Warnings can be dealt with be:
a) Changing DB master reads to use DB slaves
b) Moving the database updates to POST requests, the jobqueue, or at least to post-send updates via DeferredUpdates
c) Disabling warnings for a few exceptional cases like CentralAuth.
See channel:DBPerformance on logstash.wikimedia.org
Most of these warnings are writes or master queries on HTTP GET requests, which would be cross DC in active-active setup for some user. Ideally we could eventually get these to zero.