Page MenuHomePhabricator

Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudcephosd10[25-34]

Hostname / Racking / Installation Details

Hostnames: cloudcephosd10[25-34], no distinction on which one is the one for experimentation
Racking Proposal: Wherever there's spaces under cloudswitch switches (D5 or C8)
Networking/Subnet/VLAN/IP: Same as cloudcephosd1020.
Partitioning/Raid: Same as cloudcephosd1020
OS Distro: Buster (default unless otherwise specified)

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudcephosd1025:

  • - receive in system on procurement task T291987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage

cloudcephosd1026:

  • - receive in system on procurement task T291987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage

cloudcephosd1027:

  • - receive in system on procurement task T291987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage

cloudcephosd1028:

  • - receive in system on procurement task T291987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage

cloudcephosd1029:

  • - receive in system on procurement task T291987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage

cloudcephosd1030:

  • - receive in system on procurement task T291987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit

[]x - firmware update (idrac, bios, network, raid controller)

  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage

cloudcephosd1031:

  • - receive in system on procurement task T291987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage

cloudcephosd1032:

  • - receive in system on procurement task T291987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage

cloudcephosd1033:

  • - receive in system on procurement task T291987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage

cloudcephosd1034:

  • - receive in system on procurement task T291987 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH added a parent task: Unknown Object (Task).
RobH unsubscribed.

cloudcephosd1025 E4 U21
cloudcephosd1026 E4 U22
cloudcephosd1027 E4 U23
cloudcephosd1028 E4 U24
cloudcephosd1029 E4 U25
cloudcephosd1030 F4 U21
cloudcephosd1031 F4 U22
cloudcephosd1032 F4 U23
cloudcephosd1033 F4 U24
cloudcephosd1034 F4 U25

name rack Unit Port CableID Port CableID
cloudcephosd1025 e4 21u 21 20220102 ; 20 20220105
cloudcephosd1026 e4 22u 22 20220103 ; 21 20220107
cloudcephosd1027 e4 23u 23 20220100 ; 22 20220106
cloudcephosd1028 e4 24u 24 20220101 ; 23 20220110
cloudcephosd1029 e4 25u 25 20220104 ; 24 20220108
cloudcephosd1030 f4 21u 21 20220087 ; 20 20220081
cloudcephosd1031 f4 22u 22 20220075 ; 21 20220083
cloudcephosd1032 f4 23u 23 20220073 ; 22 20220074
cloudcephosd1033 f4 24u 24 20220084 ; 23 20220088
cloudcephosd1034 f4 25u 25 20220083 ; 24 20220095

This is blocked until vlans for these switches are ready

cmooney added a subscriber: Cmjohnson.

This requires the updated WMCS network design to be agreed / validated (T304989) after which we can quickly complete the actual device configuration (T304936). Once that is ready we can proceed with the server provisioning as normal.

cmooney added a subscriber: nskaggs.

@nskaggs I believe that to be the case yes. I've not been able to successfully reimage any of these though. I might be missing a step at this stage however.

@Cmjohnson can you confirm the current status of these servers? Are they powered on and ready for next steps? That should be do-able now.

Quick update - I've been trying to image cloudcephosd1025 to make sure all is ok, and completed some operations.

Not being completely au fait with the process I've stopped shy of attempting the reimage itself. Specifically there is one automated BIOS setting that failed, and all manual BIOS changes and firmware updates need to be performed. @Cmjohnson can you take care of those and then we can give the reimage a shot? Reimage should go ok but I've not tested it so might be some teething problems.

Detailed status:

  • - bios/drac/serial setup/testing

I ran "sudo cookbook sre.hosts.provision cloudcephosd1025" from cumin.

It managed to do most things, but serring PXEboot order failed despite several 'retries':

Updated value for attribute BIOS.Setup.1-1 -> BiosBootSeq (marked Set On Import to True): NIC.PxeDevice.1-1, NIC.PxeDevice.2-1 => HardDisk.List.1-1,NIC.Slot.3-1-1

I 'skipped' this after which the cookbook completed.

Any additional / manual BIOS changes still need to be completed.

  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.

Correct as below I believe.

cmooney@bast1003:~$ dig  short -x 10.65.2.99 @10.3.0.1
wmf11412.mgmt.eqiad.wmnet.
cloudcephosd1025.mgmt.eqiad.wmnet.
cmooney@bast1003:~$ dig  short A wmf11412.mgmt.eqiad.wmnet @10.3.0.1
10.65.2.99
cmooney@bast1003:~$ dig  short A cloudcephosd1025.mgmt.eqiad.wmnet @10.3.0.1
10.65.2.99
  • - network port setup via netbox, run homer to commit

Added following adjustment of import script to allow for rack-specific cloud vlans.

Went as expected following that: https://netbox.wikimedia.org/dcim/devices/3980/interfaces/

  • - firmware update (idrac, bios, network, raid controller)

Will leave this to DC-Ops.

  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).

Existing netboot.cfg entry should cover the disk layout:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/ /refs/heads/production/modules/install_server/files/autoinstall/netboot.cfg#193

As for puppet changes I'm not 100% what needs to be done at this stage? Are changes needed there before we can try the reimage?

  • - OS installation & initital puppet run via wmf-auto-reimage

Should go ok but do not want to proceed until I can confirm the BIOS is configured correctly and firmware is as it should be.

@Cmjohnson apologies I assigned this to you in error (blind as a bat), I see @Jclark-ctr actually did the previous work on these so re-assigning.

John I'm a little confused about the port allocations here, there seems to be some overlap?

For instance cloudcephosd1025 port 1 is listed as being connectied to 0/0/21 on the switch, but cloudcephosd1026 port 2 is also listed as connected to that?

cloudcephosd1025 e4 21u 21 20220102 ; 20 20220105
cloudcephosd1026 e4 22u 22 20220103 ; 21 20220107

Most of the others have similar overlaps. I might just be reading it wrong though but either way if you could double check/clarify that'd be great.

@cmooney Apologize for that not sure how that changed when i copied it from excel to here i noticed a few other mistakes down the list i am verify F4 right now looks like has some of the same mistakes
name rack Unit Port CableID Port CableID
cloudcephosd1025 e4 20u 21 20220102 ; 21 20220105
cloudcephosd1026 e4 22u 22 20220103 ; 23 20220107
cloudcephosd1027 e4 23u 24 20220100 ; 25 20220106
cloudcephosd1028 e4 24u 26 20220101 ; 27 20220110
cloudcephosd1029 e4 25u 28 20220104 ; 29 20220108

@Jclark-ctr ok thanks for the clarification. I've only put the port details for 1025 and 1026 into Netbox so far, ports 21 and 22, so that's not changed which is good.

Thanks for clarifying. These should be now ready to go, I need to make sure re-image / DHCP works as expected, hopefully it does, but as I wasn't sure on the BIOS/firmware stuff so I didn't try it on any of them.

cloudcephosd1030 f4 21u 20 20220087 ; 21 20220081
cloudcephosd1031 f4 22u 22 20220075 ; 23 20220083
cloudcephosd1032 f4 23u 24 20220073 ; 25 20220074
cloudcephosd1033 f4 24u 26 20220084 ; 27 20220088
cloudcephosd1034 f4 25u 28 20220083 ; 29 20220095

@Jclark-ctr I'm not really able to progress this. I was gonna try one reimage but given the disk / RAID config needs to be done, and I'm unsure of the other BIOS/firmware stuff I backed out in case I messed any of that up.

Should be no reason things can't proceed as normal from here out though. cloudcephosd1025 and cloudcephosd1026 have already had their ports added in Netbox, but need the other bits. Remainder of hosts should be able to add as normal in Netbox.

@Cmjohnson hey are you able to take care of the BIOS / RAID setup for these hosts? All should be ready for normal deploy anyway, John said you were the one who normally did those steps. Thanks.

@ayounsi lsw1-e4 and f4 do not show up as options in netbox in the provision network script.

Cmjohnson updated the task description. (Show Details)

Both ports have been updated in netbox, bios has been setup

Change 808541 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Add new cloudcephosd servers to site.pp

https://gerrit.wikimedia.org/r/808541

Change 808541 merged by Cmjohnson:

[operations/puppet@production] Add new cloudcephosd servers to site.pp

https://gerrit.wikimedia.org/r/808541

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS buster

@Andrew I am not sure which raid configuration you need. I don't know what cloudcephosd1020 has going other I see a /dev/sda and /dev/sdb. Can you please let me know the configuration needed?

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS buster executed with errors:

  • cloudcephosd1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

@Andrew I am not sure which raid configuration you need. I don't know what cloudcephosd1020 has going other I see a /dev/sda and /dev/sdb. Can you please let me know the configuration needed?

preseed says

cloudcephosd1*) echo partman/standard.cfg partman/raid1-2dev.cfg ;;

I would expect that to work for the new servers as well, as I thought these had the same hardware setup as previous hosts. The goal is to get a raid1 OS on the first two drives and leave everything else untouched for later ceph formatting.

I take it the wildcard rule above didn't work on these new hosts?

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS buster executed with errors:

  • cloudcephosd1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1027.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1026.eqiad.wmnet with OS buster

@jclark cloudcephosd1025 states no cable, can you verify the cable and/or the port please

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1029.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1030.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1032.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1033.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1034.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1027.eqiad.wmnet with OS buster completed:

  • cloudcephosd1027 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291307_cmjohnson_743095_cloudcephosd1027.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

@Jclark-ctr cloudcephosd1031 same thing, no cable, can you check this as well

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS buster executed with errors:

  • cloudcephosd1031 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1026.eqiad.wmnet with OS buster completed:

  • cloudcephosd1026 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291320_cmjohnson_744652_cloudcephosd1026.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1028.eqiad.wmnet with OS buster completed:

  • cloudcephosd1028 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291348_cmjohnson_751753_cloudcephosd1028.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1030.eqiad.wmnet with OS buster completed:

  • cloudcephosd1030 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291351_cmjohnson_752151_cloudcephosd1030.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1034.eqiad.wmnet with OS buster completed:

  • cloudcephosd1034 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291355_cmjohnson_754780_cloudcephosd1034.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1029.eqiad.wmnet with OS buster completed:

  • cloudcephosd1029 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291350_cmjohnson_751960_cloudcephosd1029.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1033.eqiad.wmnet with OS buster completed:

  • cloudcephosd1033 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291355_cmjohnson_754720_cloudcephosd1033.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1032.eqiad.wmnet with OS buster completed:

  • cloudcephosd1032 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291354_cmjohnson_753471_cloudcephosd1032.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS buster completed:

  • cloudcephosd1025 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206292045_cmjohnson_840056_cloudcephosd1025.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS buster completed:

  • cloudcephosd1031 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206292043_cmjohnson_839855_cloudcephosd1031.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Cmjohnson updated the task description. (Show Details)

pinging @Andrew to notify the task has been resolved

@Cmjohnson Hi!

While trying to setup the first of the hosts here, we noticed that it had only 7 1.8T non-os hard drives, but in the approved quote it's supposed to be 8 per host:

1.92TB SSD SATA Mix Use 6Gbps 512 2.5in Hot-plug AG Drive, 3 DWPD,400-AZTN 80

Looking then to all the hosts, there's one with 6 drives detected by the os, and the rest show 7.

Can you verify that they have actually 8 drives each, and that they are plugged in?

Thanks!

The lsblck from each of the hosts:

dcaro@cumin1001:~$ sudo cumin cloudcephosd10[25-34]\* lsblk
10 hosts will be targeted:
cloudcephosd[1025-1034].eqiad.wmnet
Ok to proceed on 10 hosts? Enter the number of affected hosts to confirm or "q" to quit 10
===== NODE GROUP =====
(1) cloudcephosd1030.eqiad.wmnet
----- OUTPUT of 'lsblk' -----
NAME           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda              8:0    0 446.6G  0 disk
├─sda1           8:1    0   285M  0 part
└─sda2           8:2    0 446.4G  0 part
  └─md0          9:0    0 446.2G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 281.5G  0 lvm   /srv
sdb              8:16   0   1.8T  0 disk
├─sdb1           8:17   0   285M  0 part
└─sdb2           8:18   0   1.8T  0 part
  └─md0          9:0    0 446.2G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 281.5G  0 lvm   /srv
sdc              8:32   0   1.8T  0 disk
sdd              8:48   0   1.8T  0 disk
sde              8:64   0   1.8T  0 disk
sdf              8:80   0   1.8T  0 disk
sdg              8:96   0   1.8T  0 disk
sdh              8:112  0   1.8T  0 disk
===== NODE GROUP =====
(8) cloudcephosd[1026-1029,1031-1034].eqiad.wmnet
----- OUTPUT of 'lsblk' -----
NAME           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda              8:0    0 446.6G  0 disk
├─sda1           8:1    0   285M  0 part
└─sda2           8:2    0 446.4G  0 part
  └─md0          9:0    0 446.2G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 281.5G  0 lvm   /srv
sdb              8:16   0   1.8T  0 disk
├─sdb1           8:17   0   285M  0 part
└─sdb2           8:18   0   1.8T  0 part
  └─md0          9:0    0 446.2G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 281.5G  0 lvm   /srv
sdc              8:32   0   1.8T  0 disk
sdd              8:48   0   1.8T  0 disk
sde              8:64   0   1.8T  0 disk
sdf              8:80   0   1.8T  0 disk
sdg              8:96   0   1.8T  0 disk
sdh              8:112  0   1.8T  0 disk
sdi              8:128  0   1.8T  0 disk
===== NODE GROUP =====
(1) cloudcephosd1025.eqiad.wmnet  <- this is the one we are trying to set up
----- OUTPUT of 'lsblk' -----
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                                                                                     8:0    0 446.6G  0 disk
├─sda1                                                                                                  8:1    0   285M  0 part
└─sda2                                                                                                  8:2    0 446.4G  0 part
  └─md0                                                                                                 9:0    0 446.2G  0 raid1
    ├─vg0-root                                                                                        253:0    0  74.5G  0 lvm   /
    ├─vg0-swap                                                                                        253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv                                                                                         253:2    0 281.5G  0 lvm   /srv
sdb                                                                                                     8:16   0   1.8T  0 disk
├─sdb1                                                                                                  8:17   0   285M  0 part
└─sdb2                                                                                                  8:18   0   1.8T  0 part
  └─md0                                                                                                 9:0    0 446.2G  0 raid1
    ├─vg0-root                                                                                        253:0    0  74.5G  0 lvm   /
    ├─vg0-swap                                                                                        253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv                                                                                         253:2    0 281.5G  0 lvm   /srv
sdc                                                                                                     8:32   0   1.8T  0 disk
└─ceph--a5dd49de--7ee1--4fe2--ad51--cb4138ca5b9f-osd--block--85d665ba--2088--40ee--b58f--ae6c8d0e2959 253:3    0   1.8T  0 lvm
sdd                                                                                                     8:48   0   1.8T  0 disk
└─ceph--c534c65d--a6f3--423b--bf17--2014915989d4-osd--block--592c5899--ccf8--45f0--810d--8b2800a4ebfb 253:4    0   1.8T  0 lvm
sde                                                                                                     8:64   0   1.8T  0 disk
└─ceph--3481f64f--ea97--405c--8ce6--7dec0895b097-osd--block--2f2888f5--9c26--4a98--8885--02680138bda9 253:5    0   1.8T  0 lvm
sdf                                                                                                     8:80   0   1.8T  0 disk
└─ceph--03a05dac--e853--4ebf--925e--3b6970d94a3c-osd--block--a5ea6a7e--ad7b--47cb--89a0--836cd7a9a7fc 253:6    0   1.8T  0 lvm
sdg                                                                                                     8:96   0   1.8T  0 disk
└─ceph--b1997071--e805--4623--910f--edbbd008f46c-osd--block--bbd092ca--faa4--42e6--9c8d--06aa27016e39 253:7    0   1.8T  0 lvm
sdh                                                                                                     8:112  0   1.8T  0 disk
└─ceph--aeea8a19--c131--4503--8a4f--a2609c394756-osd--block--8cf917e5--af82--4c67--a160--1993ccc271e4 253:8    0   1.8T  0 lvm
sdi                                                                                                     8:128  0   1.8T  0 disk
└─ceph--cf32777e--5b71--4001--b462--707ce5f181f9-osd--block--dfa9133d--a34c--4397--b048--35e9f21a7e3e 253:9    0   1.8T  0 lvm