Change Details

As I write this,Between ~2118 and ~2141 UTC on 13 Aug 2024, WDQS hosts `wdqs101[3-5],1018,1020` all alerted for SSH timeouts. Wikipedia and its sister sites are in "a full editing outage" which started at around 2100 UTC (ref T370304 for more details). Hosts `wdqs101[3-5],1018,1020` all alerted for SSH timeouts starting about 20 minutes laterThese hosts comprise the majority of public WDQS hosts in our active datacenter (EQIAD), as shown in the [[ https://config-master.wikimedia.org/pybal/eqiad/wdqs | WDQS public load balancer config ]]. **We have not confirmed the connection between these two events** ,Because Wikipedia and its sister sites were in "a full editing outage" which started at around the same time (ref T370304 for more details) , we initially believed the WQDS outage was related. But Wikipedia suffered another incident yesterday (14 Aug) at around the same time (2100 UTC) , and WDQS did not fall over. but I believe it's pruThis suggests the two incident to:s were unrelated. - Investigate whether or not there is a connectionThat being said, we still need to spend some time investigating this outage. Creating this ticket to: - Figure out if there is a way to keep WDQS from falling over should- Investigate this happen again in the futureoutage --- If sopossible, determine whether or nomake changes to prevent it's worth the effort. from happening again