Investigate mediawiki outages end October 2023
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tarrow
	Oct 27 2023, 9:05 AM

Description

There were outages on:
-26th @16:29-1655
no outages on 27th

28th @1422-1617 - repeated flapping here
29th @2213-2215 - pretty quick outage and recovery.

We observed on 27th that outage was due to OOMKilled happening on both mediawiki pods similtaneously.

We also saw a number of SQL errors likely due to mediawiki hanging up.

Root cause of this seems to have been higher levels of traffic (also traffic that is doing more expensive requests, e.g. diffing), which lead to MediaWiki exceeding their memory reservations and being OOM killed by Kubernetes. As this happened on all replicas at the same time (after one went down the other one would be overloaded instantly), this created time windows in which no pod was ready, thus creating a service outage. At extreme peaks, we have also seen pods go into CrashLoopBackoff because of that.

Event Timeline

Tarrow created this task.Oct 27 2023, 9:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 27 2023, 9:05 AM

Tarrow renamed this task from Investigate outage 2023-10-26 to Investigate mediawiki outages end October 2023.Oct 30 2023, 9:05 PM

Tarrow updated the task description. (Show Details)

Trying to correlate by when we saw OOM killing of apache by querying the logs with something like

OOM
jsonPayload.reason="OOMKilling"
apache

Shows the following graph which is a reasonably good correlation with the time we saw issues but doesn't explain times on the 25th / 27th unless these were resolved too quickly to trigger a failing uptime check.

We've increased the replicas to 3 and also bumped the memory limit.

Fring removed Tarrow as the assignee of this task.Nov 9 2023, 9:03 AM

Fring moved this task from Doing to Done on the Wikibase Cloud (Kanban board Q4 2023) board.

The system seems stable as of now (Thursday morning) so I am closing this ticket.

What happened on the way here:

after having 3 replicas with more memory running we have seen the SQL instances starting to refuse connections as we seemed to have too many clients at once trying to open a connection
we increased the number of open connections the SQL instances would allow
we would still see connections being refused, albeit less
to reduce the pressure on SQL the number of mediawiki replicas was reduced to 2 again, also memory "requests" were increased to make sure memory pressure on the node would not kill the pod
the system ran slightly more stable, but once CPU usage got higher we'd still see memory usage skyrocketing and pods getting killed
we increased the CPU limits and requests further (https://github.com/wmde/wbaas-deploy/pull/1248)
now that CPU will never be close to being saturated, memory usage is at bay and pods do not crash anymore, even during high load

Fring updated the task description. (Show Details)Nov 9 2023, 10:02 AM

Evelien_WMDE closed this task as Resolved.Nov 13 2023, 9:33 AM

Evelien_WMDE claimed this task.

	F41413603: image.png
	Oct 30 2023, 9:21 PM

Investigate mediawiki outages end October 2023Closed, ResolvedPublicActions

Description

Event Timeline

Investigate mediawiki outages end October 2023
Closed, ResolvedPublic
Actions