Page MenuHomePhabricator

Review alerting around Wikidata Query Service update pipeline
Open, MediumPublic5 Estimated Story Points

Description

Following the latest update lag incident on WDQS (T336134), alerting has been identified as an area of improvement. In particular:

  • low priority alerts are triggered too often, leading to alert fatigue and ignoring more serious alerts
  • alert message not clear enough about the actions that needs to be taken
  • priority of alerts not always clear, leading to inappropriate response

Improvement to the alerts related to WDQS update pipeline need to be improved, so that we have appropriate response in case of future incidents.

In particular, we need:

  • an alert when the update lag of pooled servers is > 10 minutes (ignoring depooled servers), this alert should not page, but make it clear that an action is needed from operators
  • review existing alerts around the update pipeline and its ingestion into WDQS, tuning of alerting levels as needed so that only alerts that require operator intervention are triggered
  • review alert messages about update lag so that it is clear that an action is needed (depooling server or depooling cluster)

AC:

  • alerts on update pipeline are identified (as a list in this ticket) and reviewed
    • alerts are triggered only when operator input is needed
    • messaging is clear about the action to be taken
  • an alert is raised when the pooled servers are lagging

Event Timeline

Quick notes here before I forget. Checking my "alerts" email folder for the past year (not the most reliable source), I have:

  • 89 alerts with title RdfStreamingUpdaterHighConsumerUpdateLag
  • 72 alerts with title RdfStreamingUpdaterFlinkProcessingLatencyIsHigh
  • 113 alerts with title RdfStreamingUpdaterFlinkJobUnstable

We'll need to do more in-depth analysis (with Logstash alerts dashboard) to get a better idea of how many of these alerts indicated a real issue. A cursory suggests that RdfStreamingUpdaterHighConsumerUpdateLag is our most relevant alert.

Revised totals for alerts in the last year after looking at Logstash:

RdfStreamingUpdaterHighConsumerUpdateLag 373
RdfStreamingUpdaterFlinkProcessingLatencyIsHigh 63
RdfStreamingUpdaterFlinkJobUnstable 125

The majority of all three alert types fired during the last outage. As previously stated, RdfStreamingUpdaterHighConsumerUpdateLag seems to be our best indicator, as it only triggered once outside the incident window.

The other two alerts seem to fire fairly frequently, but we'll need further analysis to determine how many of these alerts indicated a real issue.

Per today's SRE meeting, the larger SRE org is working on a comprehensive alert review . We should work with the SREs to help out and use their methods to review our own alerts.

Gehel triaged this task as Medium priority.Jul 21 2023, 9:27 AM