Sustainability

Tasks relating to the stability and availability of Wikimedia Foundation production services. (Unrelated to environmental sustainability.)

In T309162#8877794, @hashar wrote:

This was long forgotten. The problem is when a Scap::Target is removed from Puppet, it is not necessarily cleaned up from the deployment server.

Deployment went well, I will update the incident doc with the long-term fix and then call this resolved.

Change #1081103 merged by Jcrespo:

[operations/puppet@production] mariadb: Default pt-heartbeat to STATEMENT-based replication

https://gerrit.wikimedia.org/r/1081103

Change #1079537 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.databases: allow to select a section

https://gerrit.wikimedia.org/r/1079537

Change #1079536 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.databases.prepare: add binlog check

https://gerrit.wikimedia.org/r/1079536

Let's merge carefully https://gerrit.wikimedia.org/r/1081103 early next week (so we can monitor not affecting production hosts) CC @Ladsgroup @ABran-WMF

Change #1074127 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.databases.prepare: add check

https://gerrit.wikimedia.org/r/1074127

The (potential) change that caused it was: https://gerrit.wikimedia.org/r/c/operations/puppet/ /693162

Change #1081103 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Default pt-heartbeat STATEMENT-based replication

https://gerrit.wikimedia.org/r/1081103

Thanks Riccardo, as I said on IRC it looked like a minor issue so I wasn't too worried, and it was. Your are very fast at this, so big ❤ to you. Will continue testing, probably tomorrow morning.

The cause of that dry-run failure was the added check of replication working of MASTER_FROM from MASTER_TO added here

I need to research more line 255 change:

self._validate_slave_status(f"MASTER_TO {self.master_to.host}", status, expected)

$ test-cookbook -c 1079537 --dry-run sre.switchdc.databases.prepare --section test-s4 -t T375144 eqiad codfw
...
DRY-RUN: MASTER_TO db2230.codfw.wmnet Ignoring MASTER STATUS is not stable in DRY-RUN
DRY-RUN: [test-s4] Binlog format is STATEMENT. No heartbeat corrective action needed.
DRY-RUN: [test-s4] MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000005', position=201650676, port=3306)
DRY-RUN: MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000005', position=201650676, port=3306) and user repl2024
DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "START SLAVE"'] on 1 hosts: db1125.eqiad.wmnet
DRY-RUN: MASTER_FROM db1125.eqiad.wmnet START SLAVE
DRY-RUN: MASTER_FROM db1125.eqiad.wmnet skipping replication from MASTER_TO db2230.codfw.wmnet verification
DRY-RUN: Executing commands ['/bin/systemctl start pt-heartbeat-wikimedia.service'] on 1 hosts: db2230.codfw.wmnet
DRY-RUN: MASTER_TO db2230.codfw.wmnet started pt-heartbeat.
DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "START SLAVE"'] on 1 hosts: db2230.codfw.wmnet
DRY-RUN: MASTER_TO db2230.codfw.wmnet START SLAVE.
DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "SHOW SLAVE STATUS\\G"'] on 1 hosts: db2230.codfw.wmnet
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_Host=db1125.eqiad.wmnet
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_User=repl2024
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_Port=3306
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Slave_IO_Running=Yes
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Slave_SQL_Running=Yes
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Last_IO_Errno=0
DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Last_SQL_Errno=0
DRY-RUN: MASTER_TO db2230.codfw.wmnet replication from MASTER_FROM db1125.eqiad.wmnet verified
DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "SHOW SLAVE STATUS\\G"'] on 1 hosts: db1125.eqiad.wmnet
DRY-RUN: Failed to run cookbooks.sre.switchdc.databases.prepare.PrepareSection.master_from_check_replication: SHOW SLAVE STATUS seems to have been executed on a master.
DRY-RUN: Traceback
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 183, in confirm_on_failure
    ret = func(*args, **kwargs)
  File "/home/jynus/cookbooks_testing/cookbooks/cookbooks/sre/switchdc/databases/prepare.py", line 331, in master_from_check_replication
    status = self.master_from.show_slave_status()
  File "/usr/lib/python3/dist-packages/spicerack/mysql_legacy.py", line 201, in show_slave_status
    raise MysqlLegacyError(f"{sql} seems to have been executed on a master.")
spicerack.mysql_legacy.MysqlLegacyError: SHOW SLAVE STATUS seems to have been executed on a master.
==> What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution.
>

@jcrespo I had it almost finished yesterday but then I had to step out, I've sent the patches. If you test with the test-cookbook using as CR the last one (1079537) you'll be also testing all the other pending improvements that were done but not yet merged.
The last one allows to test it also on a custom section, so you can pass --section test-s4 and it should do the right thing.

Change #1079537 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.switchdc.databases: allow to select a section

https://gerrit.wikimedia.org/r/1079537

Change #1079536 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.switchdc.databases.prepare: fix heartbeat

https://gerrit.wikimedia.org/r/1079536

Yes, the REPLACE is not the issue, it is ROW that translates it to UPDATE or DELETE INSERT, but those would cause the same issues if doing it in the wrong case (but will do the right thing if the row was to be inserted randomly after select). We want to do replace, even if we did INSERT ignore, it won't fix things for replicas, the issue is ROW behavior, not the query itself.

Thanks @jcrespo for the detailed request. I'll get to it. Only one question, are you sure we want to use REPLACE and not INSERT? I thought that replace contributed to the issue.

So this is my request for you @Volans, this is the best thing I think we can do now:

As T112776: Implement phabricator database clustering support got declined in 2019, should this ticket also be declined?

Based on a quick read of the linked documentation and a small addition, I believe we have satisfied the requirements. Closing...

Looks good, thanks for writing this down.

SustainabilityTag
ActivePublic
Watch Project

Milestones
View All

Members (3)

Watchers (2)

Details

Recent Activity
View All

Oct 22 2024

Oct 21 2024

Oct 18 2024

Oct 17 2024

Oct 16 2024

Oct 11 2024

Oct 10 2024

Oct 7 2024

Oct 2 2024

Sep 19 2024

May 28 2024

Mar 7 2024

Jan 16 2024

Jan 9 2024

Jan 8 2024

Dec 20 2023

Dec 6 2023

Sep 15 2023

Sep 13 2023

Aug 14 2023

Aug 11 2023

Aug 7 2023

Aug 6 2023

Jul 31 2023

Jul 21 2023

SustainabilityTagActivePublicWatch Project

MilestonesView All