Review alerting around Wikidata Query Service update pipeline
Open, MediumPublic5 Estimated Story Points
Actions

Assigned To

None

Authored By

	Gehel
	May 12 2023, 1:06 PM

Description

Following the latest update lag incident on WDQS (T336134), alerting has been identified as an area of improvement. In particular:

low priority alerts are triggered too often, leading to alert fatigue and ignoring more serious alerts
alert message not clear enough about the actions that needs to be taken
priority of alerts not always clear, leading to inappropriate response

Improvement to the alerts related to WDQS update pipeline need to be improved, so that we have appropriate response in case of future incidents.

In particular, we need:

an alert when the update lag of pooled servers is > 10 minutes (ignoring depooled servers), this alert should not page, but make it clear that an action is needed from operators
review existing alerts around the update pipeline and its ingestion into WDQS, tuning of alerting levels as needed so that only alerts that require operator intervention are triggered
review alert messages about update lag so that it is clear that an action is needed (depooling server or depooling cluster)

AC:

alerts on update pipeline are identified (as a list in this ticket) and reviewed
- alerts are triggered only when operator input is needed
- messaging is clear about the action to be taken
an alert is raised when the pooled servers are lagging

Related Objects
Search...

Status	Assigned	Task
Duplicate	None	T345698 [Epic] define a strategy around alerting for Data Platform SRE and implement it
Open	None	T346438 [Epic] Review alerting strategy for Data Platform SRE
Resolved	bking	T336134 wdqs2*** lagged for more than one day
Open	None	T336574 Review alerting around Wikidata Query Service update pipeline

Event Timeline

Gehel created this task.May 12 2023, 1:06 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 12 2023, 1:06 PM

Gehel added a parent task: T336134: wdqs2*** lagged for more than one day.May 12 2023, 1:08 PM

Maintenance_bot added a project: Wikidata.May 12 2023, 1:29 PM

Restricted Application added a project: [DEPRECATED] wdwb-tech. · View Herald TranscriptMay 12 2023, 1:29 PM

• MPhamWMF moved this task from Incoming to Current work on the Wikidata-Query-Service board.May 15 2023, 3:32 PM

• MPhamWMF added a project: Discovery-Search (Current work).

• MPhamWMF set the point value for this task to 5.May 15 2023, 3:50 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.

Quick notes here before I forget. Checking my "alerts" email folder for the past year (not the most reliable source), I have:

89 alerts with title RdfStreamingUpdaterHighConsumerUpdateLag
72 alerts with title RdfStreamingUpdaterFlinkProcessingLatencyIsHigh
113 alerts with title RdfStreamingUpdaterFlinkJobUnstable

We'll need to do more in-depth analysis (with Logstash alerts dashboard) to get a better idea of how many of these alerts indicated a real issue. A cursory suggests that RdfStreamingUpdaterHighConsumerUpdateLag is our most relevant alert.

Revised totals for alerts in the last year after looking at Logstash:

RdfStreamingUpdaterHighConsumerUpdateLag 373
RdfStreamingUpdaterFlinkProcessingLatencyIsHigh 63
RdfStreamingUpdaterFlinkJobUnstable 125

The majority of all three alert types fired during the last outage. As previously stated, RdfStreamingUpdaterHighConsumerUpdateLag seems to be our best indicator, as it only triggered once outside the incident window.

The other two alerts seem to fire fairly frequently, but we'll need further analysis to determine how many of these alerts indicated a real issue.

bking mentioned this in T336577: Update WDQS Runbook following update lag incident.May 22 2023, 7:24 PM

bking edited projects, added Sustainability, SRE-OnFire; removed Sustainability (Incident Followup).May 30 2023, 8:23 PM

Per today's SRE meeting, the larger SRE org is working on a comprehensive alert review . We should work with the SREs to help out and use their methods to review our own alerts.

Gehel added a project: Data-Platform-SRE.Jun 13 2023, 8:27 AM

Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.

Gehel triaged this task as Medium priority.Jul 21 2023, 9:27 AM

Gehel moved this task from Ready for Work to Misc on the Data-Platform-SRE board.Sep 13 2023, 8:55 AM

Gehel added a parent task: T346438: [Epic] Review alerting strategy for Data Platform SRE.Sep 15 2023, 12:54 PM

Gehel moved this task from Misc to Observability on the Data-Platform-SRE board.Dec 6 2023, 1:24 PM

Gehel removed a project: Discovery-Search (Current work).Jan 16 2024, 3:20 PM

Gehel moved this task from Current work to Operations/SRE on the Wikidata-Query-Service board.

Review alerting around Wikidata Query Service update pipelineOpen, MediumPublic5 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Review alerting around Wikidata Query Service update pipeline
Open, MediumPublic5 Estimated Story Points
Actions

Related Objects
Search...