Page MenuHomePhabricator

db1138 (s4 master) crashed due to memory issues
Closed, ResolvedPublic

Description

db1138 crashed mysql due to memory HW issues:

[33134188.608450] mce: [Hardware Error]: Machine check events logged
[33134188.608477] mce: Uncorrected hardware memory error in user-access at 7d3c38f580
[33134188.615864] {1}Hardware error detected on CPU2
[33134188.615874] {1}event severity: recoverable
[33134188.615875] {1} Error 0, type: recoverable
[33134188.615876] {1} fru_text: B4
[33134188.615876] {1}  section_type: memory error
[33134188.615877] {1}  error_status: 0x0000000000000400
[33134188.615878] {1}  physical_address: 0x0000007d3c38f580
[33134188.615880] {1}  node: 3 card: 0 module: 0 rank: 0 bank: 1 row: 55982 column: 1016
[33134188.615882] {1}  DIMM location: not present. DMI handle: 0x0000
[33134188.617181] Memory failure: 0x7d3c38f: Killing mysqld:163407 due to hardware memory corruption
[33134188.626049] Memory failure: 0x7d3c38f: recovery action for dirty LRU page: Recovered
[33134263.297543] MCE: Killing mysqld:163468 due to hardware memory corruption fault at 7feced3dc580
05/27/2020 20:20:26 Critical:  "Multi-bit memory errors detected on a memory device at location(s) DIMM_B4." in SEL on db1138

What I have done for now is:

  • Decreased buffer pool size to 300GB and restarted mysql.

Let's do a master failover on Friday to the candidate master.

@wiki_willy can we get a new DIMM for this host?

Event Timeline

CDanis triaged this task as High priority.May 27 2020, 8:36 PM
CDanis subscribed.

I am running a compare from this host to its candidate master (db1081) to make sure we are good for Friday.

@Marostegui - will do, Papaul and John are working on pulling the TSR right now for the RMA. Thanks, Willy

Sent TSR report to Dell
Confirmed: Service Request 1025886499 was successfully submitted.

Change 599155 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1081 to s4 master

https://gerrit.wikimedia.org/r/599155

Change 599156 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s4-master alias

https://gerrit.wikimedia.org/r/599156

Data check between db1138 and db1081 (candidate master) finished successfully.

Blocked a maintenance window on the deployment's calendar for tomorrow.

Mentioned in SAL (#wikimedia-operations) [2020-05-28T06:30:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove db1081 from API and set its weight to 0 on main traffic - preparation for tomorrow's failover T253808', diff saved to https://phabricator.wikimedia.org/P11329 and previous config saved to /var/cache/conftool/dbconfig/20200528-063037-marostegui.json

fedex tracking says parts to arrive friday 5/29 @Marostegui would you want to do this tomorrow 3-4pm est. I would prefer not to take a host down on a friday afternoon or can we like to wait till next monday?

@Jclark-ctr @Marostegui Does this mean you're not doing the 05:00 UTC window tomorrow? We've been informing the communities about this, and set banners, so we'd need to know ASAP if this is the case.

@Johan plan continues as usual- @Jclark-ctr information is unrelated to the user impacting maintenance.

I talked to @Jclark-ctr on IRC, hw replacement will likely happen on Tuesday next week.

Sw emergency maintenance (read only) that is needed *before* hw maintenance is still tomorrow Friday 5am UTC.

Mentioned in SAL (#wikimedia-operations) [2020-05-29T04:25:19Z] <marostegui> Start topology changes in s4 - T253808

Change 599155 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1081 to s4 master

https://gerrit.wikimedia.org/r/599155

Mentioned in SAL (#wikimedia-operations) [2020-05-29T05:00:38Z] <marostegui> Starting s4 failover from db1138 to db1081 -T253808

Mentioned in SAL (#wikimedia-operations) [2020-05-29T05:01:53Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s4 as read-only for maintenance T253808', diff saved to https://phabricator.wikimedia.org/P11333 and previous config saved to /var/cache/conftool/dbconfig/20200529-050153-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-05-29T05:02:25Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1081 to s4 master and remove read-only from s4 T253808', diff saved to https://phabricator.wikimedia.org/P11334 and previous config saved to /var/cache/conftool/dbconfig/20200529-050224-marostegui.json

Change 599156 merged by Marostegui:
[operations/dns@master] wmnet: Update s4-master alias

https://gerrit.wikimedia.org/r/599156

The master failover was done successfully.
This was done successfully
RO started at 05:01:54
RO stopped at 05:02:25

Total RO time: 31 seconds

db1138 is no longer s4 master and we can do the on-site maintenance on Tuesday if @Jclark-ctr is available, can you confirm?
I will leave db1138 depooled for the weekend.

Change 599583 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1138: Disable notifications

https://gerrit.wikimedia.org/r/599583

Change 599583 merged by Marostegui:
[operations/puppet@production] db1138: Disable notifications

https://gerrit.wikimedia.org/r/599583

@Jclark-ctr could you confirm if you want to do this maintenance today Monday 1st June or tomorrow Tuesday 2nd June?

John confirmed via IRC that the maintenance will be done on Tuesday - thank you!

Mentioned in SAL (#wikimedia-operations) [2020-06-02T07:06:35Z] <marostegui> Stop MySQL and poweroff on db1138 for on-site maintenance - T253808

@Jclark-ctr db1138 is now off and ready for you to change the memory whenever you get to the DC.
Once you are done, please power the host back on and I will take it from there.

Thank you!

@Marostegui Replaced failed DIMM. host is powered back on

Thank you, I will take it from here

Change 601948 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1138: Enable notifications

https://gerrit.wikimedia.org/r/601948

Change 601948 merged by Marostegui:
[operations/puppet@production] db1138: Enable notifications

https://gerrit.wikimedia.org/r/601948

Mentioned in SAL (#wikimedia-operations) [2020-06-03T05:09:12Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1138 T253808', diff saved to https://phabricator.wikimedia.org/P11369 and previous config saved to /var/cache/conftool/dbconfig/20200603-050911-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-06-03T05:37:48Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1138 T253808', diff saved to https://phabricator.wikimedia.org/P11370 and previous config saved to /var/cache/conftool/dbconfig/20200603-053748-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-06-03T06:01:25Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1138 T253808', diff saved to https://phabricator.wikimedia.org/P11373 and previous config saved to /var/cache/conftool/dbconfig/20200603-060124-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-06-03T06:37:52Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool db1138 T253808', diff saved to https://phabricator.wikimedia.org/P11374 and previous config saved to /var/cache/conftool/dbconfig/20200603-063752-marostegui.json

Host repooled. All done.
Thanks John for replacing the memory!