Page MenuHomePhabricator

SustainabilityTag
ActivePublic

Details

Description

Tasks relating to the stability and availability of Wikimedia Foundation production services. (Unrelated to environmental sustainability.)

Recent Activity

Oct 22 2024

Dzahn added a comment to T309162: Remove old scap repositories from deploy1002.

This was long forgotten. The problem is when a Scap::Target is removed from Puppet, it is not necessarily cleaned up from the deployment server.

Oct 22 2024, 6:31 PM · Release-Engineering-Team (Radar), collaboration-services, SRE, SRE-OnFire, Sustainability
elukey closed T234234: Port architecture of irc-recentchanges to Kafka, a subtask of T128592: Add redundancy to IRC recent changes service, as Resolved.
Oct 22 2024, 10:39 AM · Sustainability, SRE, codfw-rollout

Oct 21 2024

Maintenance_bot moved T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication from In progress to Done on the DBA board.
Oct 21 2024, 2:29 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo closed T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication as Resolved.

Done: https://wikitech.wikimedia.org/wiki/Incidents/2024-09-18_replication

Oct 21 2024, 1:35 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Deployment went well, I will update the incident doc with the long-term fix and then call this resolved.

Oct 21 2024, 12:52 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
Maintenance_bot removed a project from T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication: Patch-For-Review.
Oct 21 2024, 10:31 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
gerritbot added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Change #1081103 merged by Jcrespo:

[operations/puppet@production] mariadb: Default pt-heartbeat to STATEMENT-based replication

https://gerrit.wikimedia.org/r/1081103

Oct 21 2024, 10:12 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA

Oct 18 2024

gerritbot added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Change #1079537 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.databases: allow to select a section

https://gerrit.wikimedia.org/r/1079537

Oct 18 2024, 10:17 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
gerritbot added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Change #1079536 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.databases.prepare: add binlog check

https://gerrit.wikimedia.org/r/1079536

Oct 18 2024, 10:16 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo updated subscribers of T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Let's merge carefully https://gerrit.wikimedia.org/r/1081103 early next week (so we can monitor not affecting production hosts) CC @Ladsgroup @ABran-WMF

Oct 18 2024, 10:11 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
gerritbot added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Change #1074127 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.databases.prepare: add check

https://gerrit.wikimedia.org/r/1074127

Oct 18 2024, 7:57 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA

Oct 17 2024

jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

The (potential) change that caused it was: https://gerrit.wikimedia.org/r/c/operations/puppet/ /693162

Oct 17 2024, 1:24 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
gerritbot added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Change #1081103 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Default pt-heartbeat STATEMENT-based replication

https://gerrit.wikimedia.org/r/1081103

Oct 17 2024, 10:55 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA

Oct 16 2024

jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Thanks Riccardo, as I said on IRC it looked like a minor issue so I wasn't too worried, and it was. Your are very fast at this, so big ❤ to you. Will continue testing, probably tomorrow morning.

Oct 16 2024, 6:30 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
Volans added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

The cause of that dry-run failure was the added check of replication working of MASTER_FROM from MASTER_TO added here

Oct 16 2024, 5:07 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

I need to research more line 255 change:

self._validate_slave_status(f"MASTER_TO {self.master_to.host}", status, expected)
Oct 16 2024, 2:32 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.
$ test-cookbook -c 1079537 --dry-run sre.switchdc.databases.prepare --section test-s4 -t T375144 eqiad codfw
...
DRY-RUN: MASTER_TO db2230.codfw.wmnet Ignoring MASTER STATUS is not stable in DRY-RUN
DRY-RUN: [test-s4] Binlog format is STATEMENT. No heartbeat corrective action needed.
DRY-RUN: [test-s4] MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000005', position=201650676, port=3306)
DRY-RUN: MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000005', position=201650676, port=3306) and user repl2024
DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "START SLAVE"'] on 1 hosts: db1125.eqiad.wmnet
DRY-RUN: MASTER_FROM db1125.eqiad.wmnet START SLAVE
DRY-RUN: MASTER_FROM db1125.eqiad.wmnet skipping replication from MASTER_TO db2230.codfw.wmnet verification
DRY-RUN: Executing commands ['/bin/systemctl start pt-heartbeat-wikimedia.service'] on 1 hosts: db2230.codfw.wmnet
DRY-RUN: MASTER_TO db2230.codfw.wmnet started pt-heartbeat.
DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "START SLAVE"'] on 1 hosts: db2230.codfw.wmnet
DRY-RUN: MASTER_TO db2230.codfw.wmnet START SLAVE.
DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "SHOW SLAVE STATUS\\G"'] on 1 hosts: db2230.codfw.wmnet
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_Host=db1125.eqiad.wmnet
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_User=repl2024
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_Port=3306
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Slave_IO_Running=Yes
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Slave_SQL_Running=Yes
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Last_IO_Errno=0
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Last_SQL_Errno=0
DRY-RUN: MASTER_TO db2230.codfw.wmnet replication from MASTER_FROM db1125.eqiad.wmnet verified
DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "SHOW SLAVE STATUS\\G"'] on 1 hosts: db1125.eqiad.wmnet
DRY-RUN: Failed to run cookbooks.sre.switchdc.databases.prepare.PrepareSection.master_from_check_replication: SHOW SLAVE STATUS seems to have been executed on a master.
DRY-RUN: Traceback
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 183, in confirm_on_failure
    ret = func(*args, **kwargs)
  File "/home/jynus/cookbooks_testing/cookbooks/cookbooks/sre/switchdc/databases/prepare.py", line 331, in master_from_check_replication
    status = self.master_from.show_slave_status()
  File "/usr/lib/python3/dist-packages/spicerack/mysql_legacy.py", line 201, in show_slave_status
    raise MysqlLegacyError(f"{sql} seems to have been executed on a master.")
spicerack.mysql_legacy.MysqlLegacyError: SHOW SLAVE STATUS seems to have been executed on a master.
==> What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution.
>
Oct 16 2024, 2:08 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA

Oct 11 2024

Volans added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

@jcrespo I had it almost finished yesterday but then I had to step out, I've sent the patches. If you test with the test-cookbook using as CR the last one (1079537) you'll be also testing all the other pending improvements that were done but not yet merged.
The last one allows to test it also on a custom section, so you can pass --section test-s4 and it should do the right thing.

Oct 11 2024, 3:00 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
gerritbot added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Change #1079537 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.switchdc.databases: allow to select a section

https://gerrit.wikimedia.org/r/1079537

Oct 11 2024, 2:56 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
gerritbot added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Change #1079536 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.switchdc.databases.prepare: fix heartbeat

https://gerrit.wikimedia.org/r/1079536

Oct 11 2024, 2:56 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA

Oct 10 2024

jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Yes, the REPLACE is not the issue, it is ROW that translates it to UPDATE or DELETE INSERT, but those would cause the same issues if doing it in the wrong case (but will do the right thing if the row was to be inserted randomly after select). We want to do replace, even if we did INSERT ignore, it won't fix things for replicas, the issue is ROW behavior, not the query itself.

Oct 10 2024, 2:07 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
Volans added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Thanks @jcrespo for the detailed request. I'll get to it. Only one question, are you sure we want to use REPLACE and not INSERT? I thought that replace contributed to the issue.

Oct 10 2024, 2:00 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

So this is my request for you @Volans, this is the best thing I think we can do now:

Oct 10 2024, 1:26 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA

Oct 7 2024

ABran-WMF moved T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication from Doing to External dep/In review on the Data-Persistence-SRE board.
Oct 7 2024, 7:40 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA

Oct 2 2024

jcrespo moved T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication from Backlog to Pending Review & Scorecard on the SRE-OnFire board.
Oct 2 2024, 11:44 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Preliminary incident report: https://wikitech.wikimedia.org/wiki/Incidents/2024-09-18_replication

Oct 2 2024, 11:43 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA
jcrespo edited projects for T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication, added: Sustainability, SRE-OnFire; removed Wikimedia-production-error.
Oct 2 2024, 11:21 AM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA

Sep 19 2024

Krinkle updated the task description for T88492: Devise caching (memcached) strategy for multi-DC mediawiki.
Sep 19 2024, 3:18 AM · MediaWiki-Platform-Team, MediaWiki-libs-BagOStuff, Performance-Team, Sustainability

May 28 2024

Aklapper added a comment to T143175: Configure phabricator clustering for daemons and repositories.

As T112776: Implement phabricator database clustering support got declined in 2019, should this ticket also be declined?

May 28 2024, 10:53 AM · Release-Engineering-Team (Seen), OKR-Work, Sustainability, Phabricator
Aklapper updated the task description for T143175: Configure phabricator clustering for daemons and repositories.
May 28 2024, 7:20 AM · Release-Engineering-Team (Seen), OKR-Work, Sustainability, Phabricator

Mar 7 2024

RLazarus added projects to T359583: Provide a way to get sampled POST body logs: Sustainability, MediaWiki-Engineering, Observability-Logging.
Mar 7 2024, 8:17 PM · MW-Interfaces-Team, Sustainability (Incident Followup), Observability-Logging

Jan 16 2024

Gehel moved T336574: Review alerting around Wikidata Query Service update pipeline from Current work to Operations/SRE on the Wikidata-Query-Service board.
Jan 16 2024, 3:22 PM · Data-Platform-SRE, SRE-OnFire, Sustainability, [DEPRECATED] wdwb-tech, Wikidata, Wikidata-Query-Service
Gehel removed a project from T336574: Review alerting around Wikidata Query Service update pipeline: Discovery-Search (Current work).
Jan 16 2024, 3:20 PM · Data-Platform-SRE, SRE-OnFire, Sustainability, [DEPRECATED] wdwb-tech, Wikidata, Wikidata-Query-Service

Jan 9 2024

bking closed T336577: Update WDQS Runbook following update lag incident as Resolved.
Jan 9 2024, 6:38 PM · Data-Platform-SRE (2024.01.01 - 2024.01.21), SRE-OnFire, Sustainability, Discovery-Search (Current work), Wikimedia-Incident, Wikidata, Wikidata-Query-Service
bking added a comment to T336577: Update WDQS Runbook following update lag incident.

Based on a quick read of the linked documentation and a small addition, I believe we have satisfied the requirements. Closing...

Jan 9 2024, 6:36 PM · Data-Platform-SRE (2024.01.01 - 2024.01.21), SRE-OnFire, Sustainability, Discovery-Search (Current work), Wikimedia-Incident, Wikidata, Wikidata-Query-Service

Jan 8 2024

joanna_borun triaged T293614: Enable bracketed-paste-mode for production shells (e.g. deployment, mwmaint) as Low priority.
Jan 8 2024, 3:49 PM · Patch-For-Review, Infrastructure-Foundations, Sustainability

Dec 20 2023

Gehel moved T336577: Update WDQS Runbook following update lag incident from Incidents / Follow up to 2024.01.01 - 2024.01.21 on the Data-Platform-SRE board.
Dec 20 2023, 10:53 AM · Data-Platform-SRE (2024.01.01 - 2024.01.21), SRE-OnFire, Sustainability, Discovery-Search (Current work), Wikimedia-Incident, Wikidata, Wikidata-Query-Service
Gehel placed T336577: Update WDQS Runbook following update lag incident up for grabs.
Dec 20 2023, 10:52 AM · Data-Platform-SRE (2024.01.01 - 2024.01.21), SRE-OnFire, Sustainability, Discovery-Search (Current work), Wikimedia-Incident, Wikidata, Wikidata-Query-Service

Dec 6 2023

Gehel moved T336574: Review alerting around Wikidata Query Service update pipeline from Misc to Observability on the Data-Platform-SRE board.
Dec 6 2023, 1:24 PM · Data-Platform-SRE, SRE-OnFire, Sustainability, [DEPRECATED] wdwb-tech, Wikidata, Wikidata-Query-Service
Gehel moved T336577: Update WDQS Runbook following update lag incident from Misc to Incidents / Follow up on the Data-Platform-SRE board.
Dec 6 2023, 1:22 PM · Data-Platform-SRE (2024.01.01 - 2024.01.21), SRE-OnFire, Sustainability, Discovery-Search (Current work), Wikimedia-Incident, Wikidata, Wikidata-Query-Service

Sep 15 2023

Gehel added a parent task for T336574: Review alerting around Wikidata Query Service update pipeline: T346438: [Epic] Review alerting strategy for Data Platform SRE.
Sep 15 2023, 12:54 PM · Data-Platform-SRE, SRE-OnFire, Sustainability, [DEPRECATED] wdwb-tech, Wikidata, Wikidata-Query-Service

Sep 13 2023

Gehel moved T336574: Review alerting around Wikidata Query Service update pipeline from Ready for Work to Misc on the Data-Platform-SRE board.
Sep 13 2023, 8:55 AM · Data-Platform-SRE, SRE-OnFire, Sustainability, [DEPRECATED] wdwb-tech, Wikidata, Wikidata-Query-Service
Gehel moved T336577: Update WDQS Runbook following update lag incident from Ready for Work to Misc on the Data-Platform-SRE board.
Sep 13 2023, 8:53 AM · Data-Platform-SRE (2024.01.01 - 2024.01.21), SRE-OnFire, Sustainability, Discovery-Search (Current work), Wikimedia-Incident, Wikidata, Wikidata-Query-Service

Aug 14 2023

bking closed T337801: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater as Resolved.

Looks good, thanks for writing this down.

Aug 14 2023, 4:16 PM · Discovery-Search (Current work), SRE-OnFire, Sustainability

Aug 11 2023

bking added a parent task for T337801: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater: T336134: wdqs2*** lagged for more than one day.
Aug 11 2023, 3:03 PM · Discovery-Search (Current work), SRE-OnFire, Sustainability

Aug 7 2023

bking claimed T337801: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater.
Aug 7 2023, 3:07 PM · Discovery-Search (Current work), SRE-OnFire, Sustainability

Aug 6 2023

Krinkle moved T253675: Remove mod_unique_id from app servers from Watching to Perf recommendation on the Performance-Team (Radar) board.
Aug 6 2023, 10:42 PM · Wikimedia-Performance-recommendation, Sustainability, serviceops
Krinkle removed a project from T119641: Split-brain strategy for services that use config managed by etcd: Performance-Team (Radar).
Aug 6 2023, 10:22 PM · Sustainability, Epic

Jul 31 2023

dcausse moved T337801: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater from In Progress to Needs review on the Discovery-Search (Current work) board.

Added few notes at: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Running_from_YARN

Jul 31 2023, 9:26 AM · Discovery-Search (Current work), SRE-OnFire, Sustainability

Jul 21 2023

Gehel triaged T336574: Review alerting around Wikidata Query Service update pipeline as Medium priority.
Jul 21 2023, 9:28 AM · Data-Platform-SRE, SRE-OnFire, Sustainability, [DEPRECATED] wdwb-tech, Wikidata, Wikidata-Query-Service