Page MenuHomePhabricator

Switchover s7 master db1181 -> db1136
Closed, ResolvedPublic

Description

When: Thursday 21st 06:00 AM UTC

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s7.dblist

Checklist:

NEW primary: db1136
OLD primary: db1181

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1181.eqiad.wmnet h=db1136.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s7 T313383" 'A:db-section-s7'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1136 set-weight 0
sudo dbctl config commit -m "Set db1136 with weight 0 T313383"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1181 db1136
  • Disable puppet on both nodes
sudo cumin 'db1181* or db1136*' 'disable-puppet "primary switchover T313383"'

Failover:

  • Log the failover:
!log Starting s7 eqiad failover from db1181 to db1136 - T313383
  • Set section read-only:
sudo dbctl --scope eqiad section s7 ro "Maintenance until 06:15 UTC - T313383"
sudo dbctl config commit -m "Set s7 eqiad as read-only for maintenance - T313383"
  • Check s7 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1181 db1136
echo "===== db1181 (OLD)"; sudo db-mysql db1181 -e 'show slave status\G'
echo "===== db1136 (NEW)"; sudo db-mysql db1136 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s7 set-master db1136
sudo dbctl --scope eqiad section s7 rw
sudo dbctl config commit -m "Promote db1136 to s7 primary and set section read-write T313383"
  • Restart puppet on both hosts:
sudo cumin 'db1181* or db1136*' 'run-puppet-agent -e "primary switchover T313383"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1136 heartbeat -e "delete from heartbeat where file like 'db1181%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1136
events_coredb_slave.sql on the new slave db1181
sudo dbctl instance db1181 set-candidate-master --section s7 true
sudo dbctl instance db1136 set-candidate-master --section s7 false
(dborch1001): sudo orchestrator-client -c untag -i db1136 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1181 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's7';"
  • (If needed): Depool db1181 for maintenance.
sudo dbctl instance db1181 depool
sudo dbctl config commit -m "Depool db1181 T313383"
  • Change db1181 weight to mimic the previous weight db1136:
sudo dbctl instance db1181 edit
  • Update/resolve this ticket.

Event Timeline

Marostegui updated Other Assignee, added: Ladsgroup.
Marostegui added a project: User-notice.
Marostegui updated the task description. (Show Details)
Marostegui added a subscriber: Trizek-WMF.

@Trizek-WMF this needs to happen tomorrow as part of an emergency switch issue (see parent task)

Marostegui moved this task from Triage to In progress on the DBA board.

Change 815709 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Failover s7 master

https://gerrit.wikimedia.org/r/815709

Change 815710 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1136 to s7 master

https://gerrit.wikimedia.org/r/815710

@Trizek-WMF this needs to happen tomorrow as part of an emergency switch issue (see parent task)

Sorry, I was off. I only got your message now.
We will announce it retroactively, to inform the communities about the switch, as emergencies happen too.

Mentioned in SAL (#wikimedia-operations) [2022-07-21T05:13:40Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T313383

Mentioned in SAL (#wikimedia-operations) [2022-07-21T05:13:59Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1136 with weight 0 T313383', diff saved to https://phabricator.wikimedia.org/P31559 and previous config saved to /var/cache/conftool/dbconfig/20220721-051358-root.json

Mentioned in SAL (#wikimedia-operations) [2022-07-21T05:14:09Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T313383

Change 815710 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1136 to s7 master

https://gerrit.wikimedia.org/r/815710

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:00:29Z] <marostegui> Starting s7 eqiad failover from db1181 to db1136 - T313383

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:00:37Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s7 eqiad as read-only for maintenance - T313383', diff saved to https://phabricator.wikimedia.org/P31561 and previous config saved to /var/cache/conftool/dbconfig/20220721-060037-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:01:12Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1136 to s7 primary and set section read-write T313383', diff saved to https://phabricator.wikimedia.org/P31562 and previous config saved to /var/cache/conftool/dbconfig/20220721-060112-root.json

Change 815709 merged by Marostegui:

[operations/dns@master] wmnet: Failover s7 master

https://gerrit.wikimedia.org/r/815709

This was done, pending repooling db1181.
RO starts: 06:00:37
RO stops: 06:01:12

Total read only time: 49 seconds

db1181 is being repooled