Page MenuHomePhabricator

Investigate why 70% of WDQS EQIAD hosts became unresponsive within a few minutes of each other
Closed, DeclinedPublic

Description

Between ~2118 and ~2141 UTC on 13 Aug 2024, WDQS hosts wdqs101[3-5],1018,1020 all alerted for SSH timeouts. All were unresponsive until I rebooted them from their management interface.

These hosts comprise the majority of public WDQS hosts in our active datacenter (EQIAD), as shown in the WDQS public load balancer config.

Because Wikipedia and its sister sites were in "a full editing outage" which started at around the same time (ref T370304 for more details) , we initially believed the WQDS outage was related. But Wikipedia suffered another incident yesterday (14 Aug) at around the same time (2100 UTC) , and WDQS did not fall over. This suggests the two incidents were unrelated.

That being said, we still need to spend some time investigating this outage.

Creating this ticket to:

  • Investigate this outage
  • If possible, make changes to prevent it from happening again

Details

Other Assignee
RKemper

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2024-08-13T21:56:20Z] <inflatador> bking@cumin2002 reboot wdqs101[3-5],1018,1020 from DRAC due to unresponsiveness T372442

bking renamed this task from Remediation for unresponsive WDQS hosts wdqs101[3-5],1018,1020 to Determine if WDQS was affected by wikipedia editing outage/consider protections in similiar future scenarios.Aug 13 2024, 9:58 PM
Gehel triaged this task as Medium priority.Aug 14 2024, 8:33 AM
Gehel raised the priority of this task from Medium to High.
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.
Gehel moved this task from Scratch to 2024.07.29 - 2024.08.16 on the Data-Platform-SRE board.

As Wikipedia suffered another incident yesterday at around the same time (2100 UTC) , but WDQS did not fall over, it seems these events are probably unrelated. We should still investigate what happened with WDQS, but it doesn't look like the larger incident was related. I'll update this task to reflect this.

bking renamed this task from Determine if WDQS was affected by wikipedia editing outage/consider protections in similiar future scenarios to Investigate why 70% of WDQS EQIAD hosts became unresponsive within a few minutes of each other .Aug 15 2024, 10:01 PM
bking claimed this task.
bking updated the task description. (Show Details)
bking updated Other Assignee, added: RKemper.
bking updated the task description. (Show Details)
Gehel subscribed.

Let's reopen if problem comes back