Between ~2118 and ~2141 UTC on 13 Aug 2024, WDQS hosts wdqs101[3-5],1018,1020 all alerted for SSH timeouts. All were unresponsive until I rebooted them from their management interface.
These hosts comprise the majority of public WDQS hosts in our active datacenter (EQIAD), as shown in the WDQS public load balancer config.
Because Wikipedia and its sister sites were in "a full editing outage" which started at around the same time (ref T370304 for more details) , we initially believed the WQDS outage was related. But Wikipedia suffered another incident yesterday (14 Aug) at around the same time (2100 UTC) , and WDQS did not fall over. This suggests the two incidents were unrelated.
That being said, we still need to spend some time investigating this outage.
Creating this ticket to:
- Investigate this outage
- If possible, make changes to prevent it from happening again