Page MenuHomePhabricator

(Need By: 2021-04-30) rack/setup/install backup100[4-7]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of backup100[4-7]

Hostname / Racking / Installation Details

hostname:backup1004-1007
Racking Proposal: Anywhere on 10G racks if full system, ideally not sharing a rack (for sure) or row (preferred).
Networking/Subnet/VLAN/IP: 10G, production-eqiad-network.
Partitioning/Raid: Software RAID1 for (2) OS SSDs and HW RAID 6 with writeback for (24) HDs. The recipe is the same as all other backup hosts: custom/backup-format.cfg More details at: https://wikitech.wikimedia.org/wiki/Raid_setup#Dell_R740xd2
OS Distro: Buster

Per host setup checklist

backup1004:

  • - receive in system on procurement task T264674 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - update firmware: idrac 5.00.00.00_A00, bios 2.11.2, h730p 25.5.8.0001_A16, network 2.1.80
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

backup1005:

  • - receive in system on procurement task T264674 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - update firmware: idrac 5.00.00.00_A00, bios 2.11.2, h730p 25.5.8.0001_A16, network 2.1.80
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

backup1006:

  • - receive in system on procurement task T264674 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - update firmware: idrac 5.00.00.00_A00, bios 2.11.2, h730p 25.5.8.0001_A16, network 2.1.80
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - Hardware Error: Please note as of 2021-07-12, with all firmware updates, this has the error: "The System Board CP Right is absent."
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

backup1007:

  • - receive in system on procurement task T264674 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - update firmware: idrac 5.00.00.00_A00, bios 2.11.2, h730p 25.5.8.0001_A16, network 2.1.80
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH added a parent task: Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).
RobH unsubscribed.

@wiki_willy we are short on 2u spaced in 10g racks while being diverse

Hi @Jclark-ctr - are there specific racks that you need the space in? We also have some high priority 740xd2 servers coming in Q1, that we should make room for at the same time. Thanks, Willy

@wiki_willy we are short on 2u spaced in 10g racks while being diverse

I have added a link to https://wikitech.wikimedia.org/wiki/Raid_setup#Dell_R740xd2 on the setup. One issue we found with the RAID is that only 1 disk device can be set as bootable at the same time. For this kind of hardware, we want the first SSD device to be the one set as bootable, as otherwise the automatic recipe will not work. This means that, after setting the HDs in RAID6, we need to set "Operations > make bootable > Go" on the first ssd manually.

backup1004. A4 u9 port1 Cableid#5320
backup1005. B4 u27 port11 Cableid#5351
backup1006. C2 U15 port21 Cableid#6011
backup1007. D7 U13 port12 Cableid#3970

All are finished with on-site tasks, the raid configuration was also completed.

RobH updated the task description. (Show Details)

Change 704158 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] backup100[4567] setup params

https://gerrit.wikimedia.org/r/704158

Change 704158 merged by RobH:

[operations/puppet@production] backup100[4567] setup params

https://gerrit.wikimedia.org/r/704158

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['backup1004.eqiad.wmnet', 'backup1005.eqiad.wmnet', 'backup1006.eqiad.wmnet', 'backup1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202107122027_robh_11220.log.

These failed for not liking the specified partition recipie, which was set by someone else, so I need to investigate whats up.

These failed for not liking the specified partition recipie, which was set by someone else, so I need to investigate whats up.

Did you see my comment on T277327#7110254? backup-format.cfg should work as long as the above is taken into account and they have the same hw spec as the backup200X hosts, as those worked automatically for it.

I confirmed backup1004 already had raid0 ssd set to bootable, it did, and rebooted into the installer, where it worked.... I have no idea what kind of race condition is going on there but if it doesn't happen again then it doesn't matter. Reimaging.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

backup1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107132007_robh_18959_backup1004_eqiad_wmnet.log.

Ok, so in the installer, the error is:

Unable to install GRUB in /dev/sdb
Executing 'grub-install /dev/sdb' failed.
This is a fatal error.

copies of the installer logs:



Completed auto-reimage of hosts:

['backup1004.eqiad.wmnet']

Of which those FAILED:

['backup1004.eqiad.wmnet']

The partitioning is working "as expected" (it is not a partman problem), the issue is with disks- I can only see an sda of "SSD" size and sdb of "HD size", while I would expect to see 3 disks, 2 non-raid SSDs and 1 virtual RAID disk. While it wouldn't be surprising for drive letters to move around between models, those disappearing or being different from the codfw ones are a weird case. My first suspicion would be an undetected "bad" disk, but given I can see the same issue on backup1005, my guess is on a difference with codfw setup at RAID setup time.

I will take backup1005, reboot to BIOS and confirm.

I confirm the issue is that there is a difference on the setup of the eqiad and codfw hosts. The eqiad ones have configured the SSDs as a virtual RAID disk on hardware (Perc controller), while on codfw they are "not-raided disks" and we set the software raid on OS (installation).

Admittedly, this is a "weird" decision, as it has several drawbacks: no hot-plugging, overhead/performance, and mixed OS and HW raid configuration. The decision was that for backups, performance and unavailability were not huge concerns, unlike the databases, but reliability was a main concern. So, we prefered to still handle the ssds at os level, even if technically they will be physically connected to the same RAID connector, it would be easier to manage them directly from a logical point of view.

While we are still on time to reverse the decision, if you don't think it is a good idea, I believe for now I would prefer to still to have the exact same configuration on all backup* hosts, even for those that don't have internal disks.

For that, could you modify the existing SSD setup on Perc configuration, and remove the HW RAID1, and convert those disk to "non-RAID disks"? That would made the partitioning work, as requested "Software RAID1 for (2) OS SSDs" on this ticket, and clarified as "Software RAID 1 will be set on reimage, so those SSDs should show as "not part of a RAID" on the bios, nothing to do there" on documentation linked also on this ticket: https://wikitech.wikimedia.org/wiki/Raid_setup#Dell_R740xd2 I tried to be super-clear about this as it wasn't as clear in the beginning, but I think I failed again.

PS: I left backup1005 on the Perc menu, and won't touch the host further unless you tell me to, in order to not do conflicting operations.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

backup1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107132206_robh_2571_backup1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['backup1004.eqiad.wmnet']

and were ALL successful.

I confirm the issue is that there is a difference on the setup of the eqiad and codfw hosts. The eqiad ones have configured the SSDs as a virtual RAID disk on hardware (Perc controller), while on codfw they are "not-raided disks" and we set the software raid on OS (installation).

Admittedly, this is a "weird" decision, as it has several drawbacks: no hot-plugging, overhead/performance, and mixed OS and HW raid configuration. The decision was that for backups, performance and unavailability were not huge concerns, unlike the databases, but reliability was a main concern. So, we prefered to still handle the ssds at os level, even if technically they will be physically connected to the same RAID connector, it would be easier to manage them directly from a logical point of view.

While we are still on time to reverse the decision, if you don't think it is a good idea, I believe for now I would prefer to still to have the exact same configuration on all backup* hosts, even for those that don't have internal disks.

For that, could you modify the existing SSD setup on Perc configuration, and remove the HW RAID1, and convert those disk to "non-RAID disks"? That would made the partitioning work, as requested "Software RAID1 for (2) OS SSDs" on this ticket, and clarified as "Software RAID 1 will be set on reimage, so those SSDs should show as "not part of a RAID" on the bios, nothing to do there" on documentation linked also on this ticket: https://wikitech.wikimedia.org/wiki/Raid_setup#Dell_R740xd2 I tried to be super-clear about this as it wasn't as clear in the beginning, but I think I failed again.

PS: I left backup1005 on the Perc menu, and won't touch the host further unless you tell me to, in order to not do conflicting operations.

Yeah I suppose I didn't parse that correctly, since mixing sw and hw raid within a host is a bit non-standard. I went ahead and did this for backup1004 and its now staged and ready to go. I'll fix the remainder shortly.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['backup1005.eqiad.wmnet', 'backup1006.eqiad.wmnet', 'backup1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202107132244_robh_8707.log.

...
Admittedly, this is a "weird" decision, as it has several drawbacks: no hot-plugging, overhead/performance, and mixed OS and HW raid configuration. The decision was that for backups, performance and unavailability were not huge concerns, unlike the databases, but reliability was a main concern. So, we prefered to still handle the ssds at os level, even if technically they will be physically connected to the same RAID connector, it would be easier to manage them directly from a logical point of view.
...

You can still hot plug the disk bays, just manually have to remove the disk from the mdadm array and manually add it back after, but the hot swap without power down doesn't go away when you put a disk on the PERC controller into non-raid mode. I just wanted to correct that so you aren't moving forward thinking you've lost a feature of the chassis.

Completed auto-reimage of hosts:

['backup1007.eqiad.wmnet', 'backup1005.eqiad.wmnet', 'backup1006.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)

backup1006 has a hw failure and has been placed into failure in netbox. While this task is resolving, the hw failure task T286625 has been filed for an eqiad onsite to investigate this issue of the system board cp connector failure.

You can still hot plug the disk bays, just manually have to remove the disk from the mdadm array

Thanks for the correction. I think I was based on previous experiences where the disk was not accessible because of regular direct disk connection, or the host crashing on disk loss.

Thanks for your work and that of your team on this!

One last question, backup1006, despite T286625, was setup (in terms of install puppet) successfully (no extra work will be needed after that is solved)?