Page MenuHomePhabricator

Investigate mediawiki outages end October 2023
Closed, ResolvedPublic

Description

There were outages on:
-26th @16:29-1655
no outages on 27th

  • 28th @1422-1617 - repeated flapping here
  • 29th @2213-2215 - pretty quick outage and recovery.

We observed on 27th that outage was due to OOMKilled happening on both mediawiki pods similtaneously.

We also saw a number of SQL errors likely due to mediawiki hanging up.

Root cause of this seems to have been higher levels of traffic (also traffic that is doing more expensive requests, e.g. diffing), which lead to MediaWiki exceeding their memory reservations and being OOM killed by Kubernetes. As this happened on all replicas at the same time (after one went down the other one would be overloaded instantly), this created time windows in which no pod was ready, thus creating a service outage. At extreme peaks, we have also seen pods go into CrashLoopBackoff because of that.

Event Timeline

Tarrow renamed this task from Investigate outage 2023-10-26 to Investigate mediawiki outages end October 2023.Oct 30 2023, 9:05 PM
Tarrow updated the task description. (Show Details)

Trying to correlate by when we saw OOM killing of apache by querying the logs with something like

OOM
jsonPayload.reason="OOMKilling"
apache

Shows the following graph which is a reasonably good correlation with the time we saw issues but doesn't explain times on the 25th / 27th unless these were resolved too quickly to trigger a failing uptime check.

image.png (256×1 px, 15 KB)

We've increased the replicas to 3 and also bumped the memory limit.

Fring removed Tarrow as the assignee of this task.Nov 9 2023, 9:03 AM
Fring moved this task from Doing to Done on the Wikibase Cloud (Kanban board Q4 2023) board.

The system seems stable as of now (Thursday morning) so I am closing this ticket.

What happened on the way here:

  • after having 3 replicas with more memory running we have seen the SQL instances starting to refuse connections as we seemed to have too many clients at once trying to open a connection
  • we increased the number of open connections the SQL instances would allow
  • we would still see connections being refused, albeit less
  • to reduce the pressure on SQL the number of mediawiki replicas was reduced to 2 again, also memory "requests" were increased to make sure memory pressure on the node would not kill the pod
  • the system ran slightly more stable, but once CPU usage got higher we'd still see memory usage skyrocketing and pods getting killed
  • we increased the CPU limits and requests further (https://github.com/wmde/wbaas-deploy/pull/1248)
  • now that CPU will never be close to being saturated, memory usage is at bay and pods do not crash anymore, even during high load
Evelien_WMDE claimed this task.