Tasks relating to the stability and availability of Wikimedia Foundation production services. (Unrelated to environmental sustainability.)
Details
Oct 22 2024
Oct 21 2024
Deployment went well, I will update the incident doc with the long-term fix and then call this resolved.
Change #1081103 merged by Jcrespo:
[operations/puppet@production] mariadb: Default pt-heartbeat to STATEMENT-based replication
Oct 18 2024
Change #1079537 merged by jenkins-bot:
[operations/cookbooks@master] sre.switchdc.databases: allow to select a section
Change #1079536 merged by jenkins-bot:
[operations/cookbooks@master] sre.switchdc.databases.prepare: add binlog check
Let's merge carefully https://gerrit.wikimedia.org/r/1081103 early next week (so we can monitor not affecting production hosts) CC @Ladsgroup @ABran-WMF
Change #1074127 merged by jenkins-bot:
[operations/cookbooks@master] sre.switchdc.databases.prepare: add check
Oct 17 2024
The (potential) change that caused it was: https://gerrit.wikimedia.org/r/c/operations/puppet/ /693162
Change #1081103 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] mariadb: Default pt-heartbeat STATEMENT-based replication
Oct 16 2024
Thanks Riccardo, as I said on IRC it looked like a minor issue so I wasn't too worried, and it was. Your are very fast at this, so big ❤ to you. Will continue testing, probably tomorrow morning.
The cause of that dry-run failure was the added check of replication working of MASTER_FROM from MASTER_TO added here
I need to research more line 255 change:
self._validate_slave_status(f"MASTER_TO {self.master_to.host}", status, expected)
$ test-cookbook -c 1079537 --dry-run sre.switchdc.databases.prepare --section test-s4 -t T375144 eqiad codfw ... DRY-RUN: MASTER_TO db2230.codfw.wmnet Ignoring MASTER STATUS is not stable in DRY-RUN DRY-RUN: [test-s4] Binlog format is STATEMENT. No heartbeat corrective action needed. DRY-RUN: [test-s4] MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000005', position=201650676, port=3306) DRY-RUN: MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000005', position=201650676, port=3306) and user repl2024 DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "START SLAVE"'] on 1 hosts: db1125.eqiad.wmnet DRY-RUN: MASTER_FROM db1125.eqiad.wmnet START SLAVE DRY-RUN: MASTER_FROM db1125.eqiad.wmnet skipping replication from MASTER_TO db2230.codfw.wmnet verification DRY-RUN: Executing commands ['/bin/systemctl start pt-heartbeat-wikimedia.service'] on 1 hosts: db2230.codfw.wmnet DRY-RUN: MASTER_TO db2230.codfw.wmnet started pt-heartbeat. DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "START SLAVE"'] on 1 hosts: db2230.codfw.wmnet DRY-RUN: MASTER_TO db2230.codfw.wmnet START SLAVE. DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "SHOW SLAVE STATUS\\G"'] on 1 hosts: db2230.codfw.wmnet DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_Host=db1125.eqiad.wmnet DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_User=repl2024 DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Master_Port=3306 DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Slave_IO_Running=Yes DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Slave_SQL_Running=Yes DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Last_IO_Errno=0 DRY-RUN: [test-s4] MASTER_TO db2230.codfw.wmnet checking SLAVE STATUS Last_SQL_Errno=0 DRY-RUN: MASTER_TO db2230.codfw.wmnet replication from MASTER_FROM db1125.eqiad.wmnet verified DRY-RUN: Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "SHOW SLAVE STATUS\\G"'] on 1 hosts: db1125.eqiad.wmnet DRY-RUN: Failed to run cookbooks.sre.switchdc.databases.prepare.PrepareSection.master_from_check_replication: SHOW SLAVE STATUS seems to have been executed on a master. DRY-RUN: Traceback Traceback (most recent call last): File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 183, in confirm_on_failure ret = func(*args, **kwargs) File "/home/jynus/cookbooks_testing/cookbooks/cookbooks/sre/switchdc/databases/prepare.py", line 331, in master_from_check_replication status = self.master_from.show_slave_status() File "/usr/lib/python3/dist-packages/spicerack/mysql_legacy.py", line 201, in show_slave_status raise MysqlLegacyError(f"{sql} seems to have been executed on a master.") spicerack.mysql_legacy.MysqlLegacyError: SHOW SLAVE STATUS seems to have been executed on a master. ==> What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution. >
Oct 11 2024
@jcrespo I had it almost finished yesterday but then I had to step out, I've sent the patches. If you test with the test-cookbook using as CR the last one (1079537) you'll be also testing all the other pending improvements that were done but not yet merged.
The last one allows to test it also on a custom section, so you can pass --section test-s4 and it should do the right thing.
Change #1079537 had a related patch set uploaded (by Volans; author: Volans):
[operations/cookbooks@master] sre.switchdc.databases: allow to select a section
Change #1079536 had a related patch set uploaded (by Volans; author: Volans):
[operations/cookbooks@master] sre.switchdc.databases.prepare: fix heartbeat
Oct 10 2024
Yes, the REPLACE is not the issue, it is ROW that translates it to UPDATE or DELETE INSERT, but those would cause the same issues if doing it in the wrong case (but will do the right thing if the row was to be inserted randomly after select). We want to do replace, even if we did INSERT ignore, it won't fix things for replicas, the issue is ROW behavior, not the query itself.
Thanks @jcrespo for the detailed request. I'll get to it. Only one question, are you sure we want to use REPLACE and not INSERT? I thought that replace contributed to the issue.
So this is my request for you @Volans, this is the best thing I think we can do now:
Oct 7 2024
Oct 2 2024
Preliminary incident report: https://wikitech.wikimedia.org/wiki/Incidents/2024-09-18_replication
Sep 19 2024
May 28 2024
As T112776: Implement phabricator database clustering support got declined in 2019, should this ticket also be declined?
Mar 7 2024
Jan 16 2024
Jan 9 2024
Based on a quick read of the linked documentation and a small addition, I believe we have satisfied the requirements. Closing...
Jan 8 2024
Dec 20 2023
Dec 6 2023
Sep 15 2023
Sep 13 2023
Aug 14 2023
Looks good, thanks for writing this down.