Following the latest update lag incident on WDQS (T336134), alerting has been identified as an area of improvement. In particular:
- low priority alerts are triggered too often, leading to alert fatigue and ignoring more serious alerts
- alert message not clear enough about the actions that needs to be taken
- priority of alerts not always clear, leading to inappropriate response
Improvement to the alerts related to WDQS update pipeline need to be improved, so that we have appropriate response in case of future incidents.
In particular, we need:
- an alert when the update lag of pooled servers is > 10 minutes (ignoring depooled servers), this alert should not page, but make it clear that an action is needed from operators
- review existing alerts around the update pipeline and its ingestion into WDQS, tuning of alerting levels as needed so that only alerts that require operator intervention are triggered
- review alert messages about update lag so that it is clear that an action is needed (depooling server or depooling cluster)
AC:
- alerts on update pipeline are identified (as a list in this ticket) and reviewed
- alerts are triggered only when operator input is needed
- messaging is clear about the action to be taken
- an alert is raised when the pooled servers are lagging