There were outages on:
-26th @16:29-1655
no outages on 27th
- 28th @1422-1617 - repeated flapping here
- 29th @2213-2215 - pretty quick outage and recovery.
We observed on 27th that outage was due to OOMKilled happening on both mediawiki pods similtaneously.
We also saw a number of SQL errors likely due to mediawiki hanging up.
Root cause of this seems to have been higher levels of traffic (also traffic that is doing more expensive requests, e.g. diffing), which lead to MediaWiki exceeding their memory reservations and being OOM killed by Kubernetes. As this happened on all replicas at the same time (after one went down the other one would be overloaded instantly), this created time windows in which no pod was ready, thus creating a service outage. At extreme peaks, we have also seen pods go into CrashLoopBackoff because of that.