As I write this,Between ~2118 and ~2141 UTC on 13 Aug 2024, WDQS hosts `wdqs101[3-5],1018,1020` all alerted for SSH timeouts. Wikipedia and its sister sites are in "a full editing outage" which started at around 2100 UTC (ref T370304 for more details). Hosts `wdqs101[3-5],1018,1020` all alerted for SSH timeouts starting about 20 minutes laterThese hosts comprise the majority of public WDQS hosts in our active datacenter (EQIAD), as shown in the [[ https://config-master.wikimedia.org/pybal/eqiad/wdqs | WDQS public load balancer config ]].
**We have not confirmed the connection between these two events** ,Because Wikipedia and its sister sites were in "a full editing outage" which started at around the same time (ref T370304 for more details) , we initially believed the WQDS outage was related. But Wikipedia suffered another incident yesterday (14 Aug) at around the same time (2100 UTC) , and WDQS did not fall over. but I believe it's pruThis suggests the two incident to:s were unrelated.
- Investigate whether or not there is a connectionThat being said, we still need to spend some time investigating this outage.
Creating this ticket to:
- Figure out if there is a way to keep WDQS from falling over should- Investigate this happen again in the futureoutage
--- If sopossible, determine whether or nomake changes to prevent it's worth the effort. from happening again