Page MenuHomePhabricator

Q2:(Need By: TBD) rack/setup/install ganeti102[5-8]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ganeti102[5-8]

Hostname / Racking / Installation Details

Please note the racking details were not fully provided on the parent task T291974, Rob had to make assumptions on name, partitioning, and if they can share a rack.

Hostnames: ganeti102[56]
Racking Proposal: Ideally two in row A and two in row C. If either of the two rows is too full, adding instead two to row D is also fine. Rack-wise, please add them to different racks each.
Networking/Subnet/VLAN/IP: 10G network connection, private1 vlan
Partitioning/Raid: standard ganeti-raid5.cfg
OS Distro: Buster

Per host setup checklist

ganeti1025:

  • - receive in system on procurement task T291974 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

ganeti1026:

  • - receive in system on procurement task T291974 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

ganeti1027:

  • - receive in system on procurement task T291974 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

ganeti1028:

  • - receive in system on procurement task T291974 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Related Objects

StatusSubtypeAssignedTask
Resolved Cmjohnson

Event Timeline

RobH created this task.
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH mentioned this in Unknown Object (Task).
RobH renamed this task from (Need By: TBD) rack/setup/install ganeti102[56] to (Need By: TBD) rack/setup/install ganeti102[5-8].Oct 22 2021, 5:02 PM
RobH added a parent task: Unknown Object (Task).
RobH updated the task description. (Show Details)
wiki_willy renamed this task from (Need By: TBD) rack/setup/install ganeti102[5-8] to Q2:(Need By: TBD) rack/setup/install ganeti102[5-8].Oct 22 2021, 9:49 PM
Jclark-ctr subscribed.

ganeti1025 a2 u41 cableid#1208202101 port36
ganeti1026 a7 u13 cableid#1208202102 port35
ganeti1027 c4 u13 cableid# 1208202103 port27
ganeti1028 c7 u20 cableid# 1208202104 port17

Change 747860 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding new ganeti hosts to site.pp insetup role

https://gerrit.wikimedia.org/r/747860

Change 747860 merged by Cmjohnson:

[operations/puppet@production] Adding new ganeti hosts to site.pp insetup role

https://gerrit.wikimedia.org/r/747860

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster executed with errors:

  • ganeti1027 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster executed with errors:

  • ganeti1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster executed with errors:

  • ganeti1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster executed with errors:

  • ganeti1026 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster

@Volans These servers will not install correctly, I noticed that these have embedded 1G nic cards as the primary nic and I suspect the cookbook picked up the wrong mac address. What we need in this instance is the 10G port, NIC in Slot 3 Port 1: Broadcom Adv. Dual 10Gb Ethernet - E4:3D:1A:7A:CA:40 and we need the Legacy Boot Protocol set to PXE. This is a pretty common setup but it's not always consistent.

Is there a way to fix the DHCP issue manually when this happens? This is the case for all 4 servers on the task.

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster executed with errors:

  • ganeti1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster executed with errors:

  • ganeti1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

@Volans These servers will not install correctly, I noticed that these have embedded 1G nic cards as the primary nic and I suspect the cookbook picked up the wrong mac address. What we need in this instance is the 10G port, NIC in Slot 3 Port 1: Broadcom Adv. Dual 10Gb Ethernet - E4:3D:1A:7A:CA:40 and we need the Legacy Boot Protocol set to PXE. This is a pretty common setup but it's not always consistent.

Is there a way to fix the DHCP issue manually when this happens? This is the case for all 4 servers on the task.

@Cmjohnson as explained in detail in https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#DHCP_Automation there is no MAC address involvement at all for the DHCP automation of the physical server's primary NIC. The automation is based on DHCP Option 82.

Have you checked by any chance the output of the management console during the installation when it was failing?
I can't exclude that there was a race condition as there were 8 reimages running at the same time, started very few minutes from each other, but from the logs I've checked I didn't find any evidence so far.

I've re-run it on ganeti1025 and the DHCP works fine, the host boots into PXE and then get stuck into the disk partitioning.
As far as I can tell that's because it doesn't match any host in the netboot config:

ganeti[12]009|ganeti101[0-9]|ganeti102[0-4]|ganeti201[0-9]|ganeti202[0-8]|ganeti-test200[1-3]) echo partman/custom/ganeti-raid5.cfg ;; \

The above doesn't match 1025/6/7/8 and I don't see any other line that matches.

In addition I noticed that in the BIOS/NIC config there are some potential suboptimal configurations, that should not affect the above, but still might be worth fixing:

  • The boot order seems to prefer first the embedded NIC instead of the external 10G: HardDisk.List.1-1, NIC.Embedded.1-1-1, NIC.Slot.3-1-1
  • PXE boot is enabled on both NIC.Embedded.1-1-1 (internal 1G port 1) and NIC.Slot.3-1-1 (external 10G port 1). My understanding is that it should be enabled only on one interface.

Change 749256 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] adding ganeti servers to netboot.cfg

https://gerrit.wikimedia.org/r/749256

Change 749256 merged by Cmjohnson:

[operations/puppet@production] adding ganeti servers to netboot.cfg

https://gerrit.wikimedia.org/r/749256

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster executed with errors:

  • ganeti1027 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster executed with errors:

  • ganeti1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster executed with errors:

  • ganeti1026 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster executed with errors:

  • ganeti1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster completed:

  • ganeti1025 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112212029_cmjohnson_15237_ganeti1025.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster executed with errors:

  • ganeti1027 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster executed with errors:

  • ganeti1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster executed with errors:

  • ganeti1026 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster completed:

  • ganeti1027 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112212113_cmjohnson_24829_ganeti1027.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster completed:

  • ganeti1026 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112212114_cmjohnson_24997_ganeti1026.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster completed:

  • ganeti1028 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112212117_cmjohnson_25301_ganeti1028.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Cmjohnson updated the task description. (Show Details)

DC-Ops work is finished.

Mentioned in SAL (#wikimedia-operations) [2022-01-20T13:51:00Z] <moritzm> enabled hardware virtualisation in BIOS for ganeti1025 T293909

Mentioned in SAL (#wikimedia-operations) [2022-01-20T14:56:34Z] <moritzm> enabled hardware virtualisation in BIOS for ganeti1026 T293909

Mentioned in SAL (#wikimedia-operations) [2022-01-20T15:05:22Z] <moritzm> enabled hardware virtualisation in BIOS for ganeti1027 T293909

Mentioned in SAL (#wikimedia-operations) [2022-01-20T15:12:44Z] <moritzm> enabled hardware virtualisation in BIOS for ganeti1028 T293909

Mentioned in SAL (#wikimedia-operations) [2022-01-21T15:50:14Z] <moritzm> added ganeti1025 to Ganeti eqiad cluster T293909

Change 756950 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make ganeti1027 a Ganeti node

https://gerrit.wikimedia.org/r/756950

Change 756950 merged by Muehlenhoff:

[operations/puppet@production] Make ganeti1027 a Ganeti node

https://gerrit.wikimedia.org/r/756950

Mentioned in SAL (#wikimedia-operations) [2022-01-27T09:53:11Z] <moritzm> added ganeti1027 to Ganeti eqiad cluster T293909

Mentioned in SAL (#wikimedia-operations) [2022-01-27T14:39:29Z] <moritzm> added ganeti1028 to Ganeti eqiad cluster T293909