⚓ T294972 Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34]

	Subject	Repo	Branch	Lines /-
	Add new cloudcephosd servers to site.pp	operations/puppet	production	5 -0

Status	Subtype	Assigned	Task
Resolved		cmooney	T304989 Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F
Resolved		cmooney	T304936 Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad
			Unknown Object (Task)
Resolved		• Cmjohnson	T294972 Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34]
Resolved		fnegri	T314870 Setup cloudcephosd10[25-34] into the ceph eqiad cluster
Resolved		• Cmjohnson	T315221 cloudcephosd10[25-34] Missing/unplugged hard drives
Resolved		cmooney	T315446 Allow jumbo frames between cloud hosts in production realm
Resolved	BUG REPORT	Jclark-ctr	T316673 hw troubleshooting: one disk not working properly in cloudcephosd1034.eqiad.wmnet
Resolved	Request	Jclark-ctr	T317127 hw troubleshooting: power supply alert for cloudcephosd1031.eqiad.wmnet
Resolved	BUG REPORT	fnegri	T317219 Ceph cookbook fails on node with empty children list
Declined		fnegri	T318680 Ceph cookbook fails when checking jumbo frames
Resolved		fnegri	T318723 Ceph cookbook fails waiting for OSDs to show up

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.Nov 3 2021, 7:49 PM

RobH added a parent task: Unknown Object (Task).

RobH unsubscribed.

Maintenance_bot added a project: SRE.Nov 3 2021, 8:45 PM

• nskaggs moved this task from Backlog to Racking / Decom on the cloud-services-team (Hardware) board.Jan 5 2022, 7:27 PM

Jclark-ctr updated the task description. (Show Details)Feb 9 2022, 7:37 PM

cloudcephosd1025 E4 U21
cloudcephosd1026 E4 U22
cloudcephosd1027 E4 U23
cloudcephosd1028 E4 U24
cloudcephosd1029 E4 U25
cloudcephosd1030 F4 U21
cloudcephosd1031 F4 U22
cloudcephosd1032 F4 U23
cloudcephosd1033 F4 U24
cloudcephosd1034 F4 U25

dcaro mentioned this in T297083: [ceph] Getting rack level HA.Mar 4 2022, 3:08 PM

dcaro subscribed.Mar 4 2022, 3:17 PM

name rack Unit Port CableID Port CableID
cloudcephosd1025 e4 21u 21 20220102 ; 20 20220105
cloudcephosd1026 e4 22u 22 20220103 ; 21 20220107
cloudcephosd1027 e4 23u 23 20220100 ; 22 20220106
cloudcephosd1028 e4 24u 24 20220101 ; 23 20220110
cloudcephosd1029 e4 25u 25 20220104 ; 24 20220108
cloudcephosd1030 f4 21u 21 20220087 ; 20 20220081
cloudcephosd1031 f4 22u 22 20220075 ; 21 20220083
cloudcephosd1032 f4 23u 23 20220073 ; 22 20220074
cloudcephosd1033 f4 24u 24 20220084 ; 23 20220088
cloudcephosd1034 f4 25u 25 20220083 ; 24 20220095

Jclark-ctr reassigned this task from Jclark-ctr to • Cmjohnson.Mar 10 2022, 4:54 PM

Jclark-ctr subscribed.

cmooney subscribed.Mar 29 2022, 7:27 PM

• Cmjohnson moved this task from Racking Tasks to Blocked on the ops-eqiad board.Apr 7 2022, 8:04 PM

This is blocked until vlans for these switches are ready

cmooney added a parent task: T304936: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad.Apr 28 2022, 6:10 PM

This requires the updated WMCS network design to be agreed / validated (T304989) after which we can quickly complete the actual device configuration (T304936). Once that is ready we can proceed with the server provisioning as normal.

@nskaggs I believe that to be the case yes. I've not been able to successfully reimage any of these though. I might be missing a step at this stage however.

@Cmjohnson can you confirm the current status of these servers? Are they powered on and ready for next steps? That should be do-able now.

cmooney updated the task description. (Show Details)May 26 2022, 2:37 PM

Quick update - I've been trying to image cloudcephosd1025 to make sure all is ok, and completed some operations.

Not being completely au fait with the process I've stopped shy of attempting the reimage itself. Specifically there is one automated BIOS setting that failed, and all manual BIOS changes and firmware updates need to be performed. @Cmjohnson can you take care of those and then we can give the reimage a shot? Reimage should go ok but I've not tested it so might be some teething problems.

Detailed status:

- bios/drac/serial setup/testing

I ran "sudo cookbook sre.hosts.provision cloudcephosd1025" from cumin.

It managed to do most things, but serring PXEboot order failed despite several 'retries':

Updated value for attribute BIOS.Setup.1-1 -> BiosBootSeq (marked Set On Import to True): NIC.PxeDevice.1-1, NIC.PxeDevice.2-1 => HardDisk.List.1-1,NIC.Slot.3-1-1

I 'skipped' this after which the cookbook completed.

Any additional / manual BIOS changes still need to be completed.

- add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.

Correct as below I believe.

cmooney@bast1003:~$ dig  short -x 10.65.2.99 @10.3.0.1
wmf11412.mgmt.eqiad.wmnet.
cloudcephosd1025.mgmt.eqiad.wmnet.
cmooney@bast1003:~$ dig  short A wmf11412.mgmt.eqiad.wmnet @10.3.0.1
10.65.2.99
cmooney@bast1003:~$ dig  short A cloudcephosd1025.mgmt.eqiad.wmnet @10.3.0.1
10.65.2.99

- network port setup via netbox, run homer to commit

Added following adjustment of import script to allow for rack-specific cloud vlans.

Went as expected following that: https://netbox.wikimedia.org/dcim/devices/3980/interfaces/

- firmware update (idrac, bios, network, raid controller)

Will leave this to DC-Ops.

- operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).

Existing netboot.cfg entry should cover the disk layout:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/ /refs/heads/production/modules/install_server/files/autoinstall/netboot.cfg#193

As for puppet changes I'm not 100% what needs to be done at this stage? Are changes needed there before we can try the reimage?

- OS installation & initital puppet run via wmf-auto-reimage

Should go ok but do not want to proceed until I can confirm the BIOS is configured correctly and firmware is as it should be.

@Cmjohnson apologies I assigned this to you in error (blind as a bat), I see @Jclark-ctr actually did the previous work on these so re-assigning.

John I'm a little confused about the port allocations here, there seems to be some overlap?

For instance cloudcephosd1025 port 1 is listed as being connectied to 0/0/21 on the switch, but cloudcephosd1026 port 2 is also listed as connected to that?

cloudcephosd1025 e4 21u 21 20220102 ; 20 20220105
cloudcephosd1026 e4 22u 22 20220103 ; 21 20220107

Most of the others have similar overlaps. I might just be reading it wrong though but either way if you could double check/clarify that'd be great.

@cmooney Apologize for that not sure how that changed when i copied it from excel to here i noticed a few other mistakes down the list i am verify F4 right now looks like has some of the same mistakes
name rack Unit Port CableID Port CableID
cloudcephosd1025 e4 20u 21 20220102 ; 21 20220105
cloudcephosd1026 e4 22u 22 20220103 ; 23 20220107
cloudcephosd1027 e4 23u 24 20220100 ; 25 20220106
cloudcephosd1028 e4 24u 26 20220101 ; 27 20220110
cloudcephosd1029 e4 25u 28 20220104 ; 29 20220108

@Jclark-ctr ok thanks for the clarification. I've only put the port details for 1025 and 1026 into Netbox so far, ports 21 and 22, so that's not changed which is good.

Thanks for clarifying. These should be now ready to go, I need to make sure re-image / DHCP works as expected, hopefully it does, but as I wasn't sure on the BIOS/firmware stuff so I didn't try it on any of them.

cloudcephosd1030 f4 21u 20 20220087 ; 21 20220081
cloudcephosd1031 f4 22u 22 20220075 ; 23 20220083
cloudcephosd1032 f4 23u 24 20220073 ; 25 20220074
cloudcephosd1033 f4 24u 26 20220084 ; 27 20220088
cloudcephosd1034 f4 25u 28 20220083 ; 29 20220095

Jclark-ctr reassigned this task from Jclark-ctr to cmooney.May 26 2022, 9:12 PM

@Jclark-ctr I'm not really able to progress this. I was gonna try one reimage but given the disk / RAID config needs to be done, and I'm unsure of the other BIOS/firmware stuff I backed out in case I messed any of that up.

Should be no reason things can't proceed as normal from here out though. cloudcephosd1025 and cloudcephosd1026 have already had their ports added in Netbox, but need the other bits. Remainder of hosts should be able to add as normal in Netbox.

@Cmjohnson hey are you able to take care of the BIOS / RAID setup for these hosts? All should be ready for normal deploy anyway, John said you were the one who normally did those steps. Thanks.

wiki_willy reassigned this task from cmooney to • Cmjohnson.Jun 17 2022, 10:40 PM

wiki_willy moved this task from Blocked to Racking Tasks on the ops-eqiad board.

@ayounsi lsw1-e4 and f4 do not show up as options in netbox in the provision network script.

@Cmjohnson they're named "cloudsw1-e4/f4"

• Cmjohnson updated the task description. (Show Details)Jun 24 2022, 3:08 PM

Both ports have been updated in netbox, bios has been setup

Change 808541 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Add new cloudcephosd servers to site.pp

https://gerrit.wikimedia.org/r/808541

Change 808541 merged by Cmjohnson:

[operations/puppet@production] Add new cloudcephosd servers to site.pp

https://gerrit.wikimedia.org/r/808541

Maintenance_bot removed a project: Patch-For-Review.Jun 26 2022, 10:30 PM

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS buster

@Andrew I am not sure which raid configuration you need. I don't know what cloudcephosd1020 has going other I see a /dev/sda and /dev/sdb. Can you please let me know the configuration needed?

• Cmjohnson updated the task description. (Show Details)Jun 26 2022, 11:00 PM

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS buster executed with errors:

cloudcephosd1025 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

• Cmjohnson moved this task from Racking Tasks to Blocked on the ops-eqiad board.Jun 27 2022, 12:55 AM

In T294972#8027892, @Cmjohnson wrote:

@Andrew I am not sure which raid configuration you need. I don't know what cloudcephosd1020 has going other I see a /dev/sda and /dev/sdb. Can you please let me know the configuration needed?

preseed says

cloudcephosd1*) echo partman/standard.cfg partman/raid1-2dev.cfg ;;

I would expect that to work for the new servers as well, as I thought these had the same hardware setup as previous hosts. The goal is to get a raid1 OS on the first two drives and leave everything else untouched for later ceph formatting.

I take it the wildcard rule above didn't work on these new hosts?

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS buster executed with errors:

cloudcephosd1025 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1027.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1026.eqiad.wmnet with OS buster

@jclark cloudcephosd1025 states no cable, can you verify the cable and/or the port please

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1029.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1030.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1032.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1033.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1034.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1027.eqiad.wmnet with OS buster completed:

cloudcephosd1027 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291307_cmjohnson_743095_cloudcephosd1027.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

@Jclark-ctr cloudcephosd1031 same thing, no cable, can you check this as well

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS buster executed with errors:

cloudcephosd1031 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1026.eqiad.wmnet with OS buster completed:

cloudcephosd1026 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291320_cmjohnson_744652_cloudcephosd1026.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1028.eqiad.wmnet with OS buster completed:

cloudcephosd1028 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291348_cmjohnson_751753_cloudcephosd1028.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1030.eqiad.wmnet with OS buster completed:

cloudcephosd1030 (WARN)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291351_cmjohnson_752151_cloudcephosd1030.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1034.eqiad.wmnet with OS buster completed:

cloudcephosd1034 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291355_cmjohnson_754780_cloudcephosd1034.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1029.eqiad.wmnet with OS buster completed:

cloudcephosd1029 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291350_cmjohnson_751960_cloudcephosd1029.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1033.eqiad.wmnet with OS buster completed:

cloudcephosd1033 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291355_cmjohnson_754720_cloudcephosd1033.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1032.eqiad.wmnet with OS buster completed:

cloudcephosd1032 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291354_cmjohnson_753471_cloudcephosd1032.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

• Cmjohnson updated the task description. (Show Details)Jun 29 2022, 5:54 PM

• Cmjohnson moved this task from Blocked to Racking Tasks on the ops-eqiad board.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS buster completed:

cloudcephosd1025 (WARN)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206292045_cmjohnson_840056_cloudcephosd1025.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS buster completed:

cloudcephosd1031 (WARN)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206292043_cmjohnson_839855_cloudcephosd1031.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

pinging @Andrew to notify the task has been resolved

dcaro added a subtask: T314870: Setup cloudcephosd10[25-34] into the ceph eqiad cluster.Aug 9 2022, 3:29 PM

fnegri changed the status of subtask T314870: Setup cloudcephosd10[25-34] into the ceph eqiad cluster from Open to In Progress.Aug 9 2022, 3:31 PM

@Cmjohnson Hi!

While trying to setup the first of the hosts here, we noticed that it had only 7 1.8T non-os hard drives, but in the approved quote it's supposed to be 8 per host:

1.92TB SSD SATA Mix Use 6Gbps 512 2.5in Hot-plug AG Drive, 3 DWPD,400-AZTN 80

Looking then to all the hosts, there's one with 6 drives detected by the os, and the rest show 7.

Can you verify that they have actually 8 drives each, and that they are plugged in?

Thanks!

The lsblck from each of the hosts:

dcaro@cumin1001:~$ sudo cumin cloudcephosd10[25-34]\* lsblk
10 hosts will be targeted:
cloudcephosd[1025-1034].eqiad.wmnet
Ok to proceed on 10 hosts? Enter the number of affected hosts to confirm or "q" to quit 10
===== NODE GROUP =====
(1) cloudcephosd1030.eqiad.wmnet
----- OUTPUT of 'lsblk' -----
NAME           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda              8:0    0 446.6G  0 disk
├─sda1           8:1    0   285M  0 part
└─sda2           8:2    0 446.4G  0 part
  └─md0          9:0    0 446.2G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 281.5G  0 lvm   /srv
sdb              8:16   0   1.8T  0 disk
├─sdb1           8:17   0   285M  0 part
└─sdb2           8:18   0   1.8T  0 part
  └─md0          9:0    0 446.2G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 281.5G  0 lvm   /srv
sdc              8:32   0   1.8T  0 disk
sdd              8:48   0   1.8T  0 disk
sde              8:64   0   1.8T  0 disk
sdf              8:80   0   1.8T  0 disk
sdg              8:96   0   1.8T  0 disk
sdh              8:112  0   1.8T  0 disk
===== NODE GROUP =====
(8) cloudcephosd[1026-1029,1031-1034].eqiad.wmnet
----- OUTPUT of 'lsblk' -----
NAME           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda              8:0    0 446.6G  0 disk
├─sda1           8:1    0   285M  0 part
└─sda2           8:2    0 446.4G  0 part
  └─md0          9:0    0 446.2G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 281.5G  0 lvm   /srv
sdb              8:16   0   1.8T  0 disk
├─sdb1           8:17   0   285M  0 part
└─sdb2           8:18   0   1.8T  0 part
  └─md0          9:0    0 446.2G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 281.5G  0 lvm   /srv
sdc              8:32   0   1.8T  0 disk
sdd              8:48   0   1.8T  0 disk
sde              8:64   0   1.8T  0 disk
sdf              8:80   0   1.8T  0 disk
sdg              8:96   0   1.8T  0 disk
sdh              8:112  0   1.8T  0 disk
sdi              8:128  0   1.8T  0 disk
===== NODE GROUP =====
(1) cloudcephosd1025.eqiad.wmnet  <- this is the one we are trying to set up
----- OUTPUT of 'lsblk' -----
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                                                                                     8:0    0 446.6G  0 disk
├─sda1                                                                                                  8:1    0   285M  0 part
└─sda2                                                                                                  8:2    0 446.4G  0 part
  └─md0                                                                                                 9:0    0 446.2G  0 raid1
    ├─vg0-root                                                                                        253:0    0  74.5G  0 lvm   /
    ├─vg0-swap                                                                                        253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv                                                                                         253:2    0 281.5G  0 lvm   /srv
sdb                                                                                                     8:16   0   1.8T  0 disk
├─sdb1                                                                                                  8:17   0   285M  0 part
└─sdb2                                                                                                  8:18   0   1.8T  0 part
  └─md0                                                                                                 9:0    0 446.2G  0 raid1
    ├─vg0-root                                                                                        253:0    0  74.5G  0 lvm   /
    ├─vg0-swap                                                                                        253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv                                                                                         253:2    0 281.5G  0 lvm   /srv
sdc                                                                                                     8:32   0   1.8T  0 disk
└─ceph--a5dd49de--7ee1--4fe2--ad51--cb4138ca5b9f-osd--block--85d665ba--2088--40ee--b58f--ae6c8d0e2959 253:3    0   1.8T  0 lvm
sdd                                                                                                     8:48   0   1.8T  0 disk
└─ceph--c534c65d--a6f3--423b--bf17--2014915989d4-osd--block--592c5899--ccf8--45f0--810d--8b2800a4ebfb 253:4    0   1.8T  0 lvm
sde                                                                                                     8:64   0   1.8T  0 disk
└─ceph--3481f64f--ea97--405c--8ce6--7dec0895b097-osd--block--2f2888f5--9c26--4a98--8885--02680138bda9 253:5    0   1.8T  0 lvm
sdf                                                                                                     8:80   0   1.8T  0 disk
└─ceph--03a05dac--e853--4ebf--925e--3b6970d94a3c-osd--block--a5ea6a7e--ad7b--47cb--89a0--836cd7a9a7fc 253:6    0   1.8T  0 lvm
sdg                                                                                                     8:96   0   1.8T  0 disk
└─ceph--b1997071--e805--4623--910f--edbbd008f46c-osd--block--bbd092ca--faa4--42e6--9c8d--06aa27016e39 253:7    0   1.8T  0 lvm
sdh                                                                                                     8:112  0   1.8T  0 disk
└─ceph--aeea8a19--c131--4503--8a4f--a2609c394756-osd--block--8cf917e5--af82--4c67--a160--1993ccc271e4 253:8    0   1.8T  0 lvm
sdi                                                                                                     8:128  0   1.8T  0 disk
└─ceph--cf32777e--5b71--4001--b462--707ce5f181f9-osd--block--dfa9133d--a34c--4397--b048--35e9f21a7e3e 253:9    0   1.8T  0 lvm

dcaro mentioned this in T314870: Setup cloudcephosd10[25-34] into the ceph eqiad cluster.Aug 15 2022, 11:10 AM

dcaro closed subtask T315221: cloudcephosd10[25-34] Missing/unplugged hard drives as Resolved.Aug 17 2022, 1:15 PM

fnegri closed subtask T314870: Setup cloudcephosd10[25-34] into the ceph eqiad cluster as Resolved.Oct 5 2022, 1:26 PM

• nskaggs mentioned this in T324998: Q3:rack/setup/install cloudcephosd10(3[5-9]|40).May 2 2023, 7:27 PM

Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34]
Closed, ResolvedPublic
Actions

Description

Hostname / Racking / Installation Details

Per host setup checklist

Details

Related Objects
Search...

Event Timeline

Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34]Closed, ResolvedPublicActions

Description

Hostname / Racking / Installation Details

Per host setup checklist

Details

Related ObjectsSearch...

Event Timeline

Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34]
Closed, ResolvedPublic
Actions

Related Objects
Search...