Investigate why 70% of WDQS EQIAD hosts became unresponsive within a few minutes of each other
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	bking
	Aug 13 2024, 9:55 PM

Description

Between ~2118 and ~2141 UTC on 13 Aug 2024, WDQS hosts wdqs101[3-5],1018,1020 all alerted for SSH timeouts. All were unresponsive until I rebooted them from their management interface.

These hosts comprise the majority of public WDQS hosts in our active datacenter (EQIAD), as shown in the WDQS public load balancer config.

Because Wikipedia and its sister sites were in "a full editing outage" which started at around the same time (ref T370304 for more details) , we initially believed the WQDS outage was related. But Wikipedia suffered another incident yesterday (14 Aug) at around the same time (2100 UTC) , and WDQS did not fall over. This suggests the two incidents were unrelated.

That being said, we still need to spend some time investigating this outage.

Creating this ticket to:

Investigate this outage
If possible, make changes to prevent it from happening again

Details

Other Assignee: RKemper

Related Objects

Mentioned Here: T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis

Event Timeline

bking created this task.Aug 13 2024, 9:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 13 2024, 9:55 PM

Mentioned in SAL (#wikimedia-operations) [2024-08-13T21:56:20Z] <inflatador> bking@cumin2002 reboot wdqs101[3-5],1018,1020 from DRAC due to unresponsiveness T372442

bking renamed this task from Remediation for unresponsive WDQS hosts wdqs101[3-5],1018,1020 to Determine if WDQS was affected by wikipedia editing outage/consider protections in similiar future scenarios.Aug 13 2024, 9:58 PM

bking updated the task description. (Show Details)Aug 13 2024, 10:02 PM

Gehel triaged this task as Medium priority.Aug 14 2024, 8:33 AM

Gehel raised the priority of this task from Medium to High.

Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Gehel moved this task from Scratch to 2024.07.29 - 2024.08.16 on the Data-Platform-SRE board.

Gehel edited projects, added Data-Platform-SRE (2024.07.29 - 2024.08.16); removed Data-Platform-SRE.

Lydia_Pintscher added a project: Wikidata-Query-Service.Aug 14 2024, 2:29 PM

Maintenance_bot added a project: Wikidata.Aug 14 2024, 3:29 PM

dr0ptp4kt subscribed.Aug 15 2024, 8:14 PM

As Wikipedia suffered another incident yesterday at around the same time (2100 UTC) , but WDQS did not fall over, it seems these events are probably unrelated. We should still investigate what happened with WDQS, but it doesn't look like the larger incident was related. I'll update this task to reflect this.

bking renamed this task from Determine if WDQS was affected by wikipedia editing outage/consider protections in similiar future scenarios to Investigate why 70% of WDQS EQIAD hosts became unresponsive within a few minutes of each other .Aug 15 2024, 10:01 PM

bking claimed this task.

bking updated the task description. (Show Details)

bking updated Other Assignee, added: RKemper.

bking updated the task description. (Show Details)

Gehel edited projects, added Data-Platform-SRE (2024.08.17 - 2024.09.06); removed Data-Platform-SRE (2024.07.29 - 2024.08.16).Aug 16 2024, 9:45 AM

Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.Aug 19 2024, 3:31 PM

mdaniels5757 subscribed.Aug 21 2024, 6:54 PM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.Sep 3 2024, 3:12 PM

Gehel edited projects, added Data-Platform-SRE (2024.09.06 - 2024.09.27); removed Data-Platform-SRE (2024.08.17 - 2024.09.06).Sep 6 2024, 9:49 AM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2024.09.06 - 2024.09.27) board.