Page MenuHomePhabricator

Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of X

Hostname / Racking / Installation Details

Hostnames: cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev
Racking Proposal: WMCS racks. Ideally separate rows for each cloudcontrol, and separate rows for each cloudnet (can share a row with a cloudcontrol). Can also share row with other cloudcontrols (cloudcontrol1005-1007), just not each other.
Networking Setup: 1 connections, 10G. cloud-hosts vlan (additional trunked vlans to be added by netops later)
Partitioning/Raid: Software RAID "partman/standard.cfg partman/raid10-4dev.cfg"
OS Distro: Bookworm
Sub-team Technical Contact: @aborrero @Andrew

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudcontrol1008-dev:
  • - receive in system on procurement task T341246 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).

[] - OS installation & initital puppet run via sre.hosts.reimage cookbook.

cloudcontrol1009-dev:
  • - receive in system on procurement task T341246 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).

[] - OS installation & initital puppet run via sre.hosts.reimage cookbook.

cloudcontrol1010-dev:
  • - receive in system on procurement task T341246 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).

[] - OS installation & initital puppet run via sre.hosts.reimage cookbook.

cloudnet1007-dev:
  • - receive in system on procurement task T341246 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).

[] - OS installation & initital puppet run via sre.hosts.reimage cookbook.

cloudnet1008-dev:
  • - receive in system on procurement task T341246 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).

[] - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

cloudcontrol1008 D 5 U 38. Port 6 Cableid 230304500102 Port 36 Cableid 230304500268
cloudcontrol1009. E 4 U 40. Port 12 Cableid 230304500142 ; Port 13 Cableid 230304500264
cloudcontrol1010. F 4. U 40 Port 12 Cableid 230304500151 ; Port 13 Cableid 230304500148
cloudnet1007. E 4. U 39 Port 10 Cableid 230304500168 ; Port 11 Cableid 230304500166
cloudnet1008 F 4. U 39 Port 10 Cableid 230304500296 ; Port 11 Cableid 230304500153

aborrero removed a subtask: Unknown Object (Task).Sep 18 2023, 4:11 PM

These servers are going to be part of the eqiad2dev deployment, and should get the -devprefix on them, for example cloudcontrol1008-dev.

nskaggs renamed this task from Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] to Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev.Sep 26 2023, 7:42 PM
nskaggs updated the task description. (Show Details)

Wouldn't in make sense to start on 1001-dev? (otherwise it seems that 1007-dev should exist, or will jump to 1011 directly on the other pool)

It's recommended that existing names not be reused. See https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions. I don't think a prefix or suffix would change that (ala cloudcontrol1001 vs cloudcontrol1001-dev). That said, exceptions are possible and adding the -dev would allow it to be a "unique" name.

All that said, I think it's possible that someday the -dev suffix is removed from these machines. Do you agree? If we think that's a possibility, does that change anything?

@cmooney i am having issues with Racks e4 and f4 these are cloud public vlan in new cage Wmcs had asked for them not to share racks.

Failure
cloudcontrol1009-dev (WMF11302): unable to find VLAN with name public1-e-eqiad or public1-e4-eqiad, skipping.

cloudnet1007 E 4. U 39 Port 10 Cableid 230304500168 ; Port 11 Cableid 230304500166
cloudnet1008 F 4. U 39 Port 10 Cableid 230304500296 ; Port 11 Cableid 230304500153

Unfortunately I think it's advisable that we move these nodes. If they will act as egress point for traffic from compute hosts it's better one sits in C8 and one in D5. If they go in the LEAF racks then you'll get traffic flows like:

F4 -> C8/D5 -> E4 -> C8/D5 -> Internet

Rather than:

F4 -> C8/D5 -> Internet

Could we please make sure we have the -dev sufix in them? Otherwise we will need to rename them later.

Unfortunately I think it's advisable that we move these nodes. If they will act as egress point for traffic from compute hosts it's better one sits in C8 and one in D5. If they go in the LEAF racks then you'll get traffic flows like:

This can be confusing. These hosts are going to be used to bootstrap a k8s-based openstack PoC. I don't think these hosts are going to be dedicated to networking, but generic k8s workers that could run whatever pods (maybe including the neutron-l3-agent?).

In any case, this deployment wont be customer-facing and I would not worry too much about their position in the network.

These hosts are going to be used to bootstrap a k8s-based openstack PoC. I don't think these hosts are going to be dedicated to networking, but generic k8s workers that could run whatever pods (maybe including the neutron-l3-agent?).

Ok. Let's discuss how the networking works in our upcoming meeting. We certainly want to avoid a design that results in valley routing under normal circumstances. And we shouldn't do a POC we know is set up differently from what we want in production.

In the meantime can I request we call these something else? "cloudnet" seems confusing if they are not going to be dedicated network nodes.

Networking Setup: 2 connections, 10G. public1-*-eqiad

This is incorrect.

All these hosts should have a single connection to a cloudsw (so racks c8/d5/e4/f4), connection to 'cloud-hosts' vlan. Details here:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network#Datacenter_network

Please follow the other instructions for each host, but no need to do the last step (reimage). Some manual netbox changes are needed before that to add the additional vlan, ping me on irc I can do that.

@Jclark-ctr these hosts are for a new proof-of-concept cloud openstack deployment. As such the rules on the vlans don't really apply.

I think there is a good argument to name the cloudnet's something else.

For now I think for all 4 of these hosts we just need the generic cloud host setup: single link, with untagged vlan of "cloud-hosts" and tagged vlan of "cloud-private".

@Jclark-ctr these hosts are for a new proof-of-concept cloud openstack deployment. As such the rules on the vlans don't really apply.

I think there is a good argument to name the cloudnet's something else.

For now I think for all 4 of these hosts we just need the generic cloud host setup: single link, with untagged vlan of "cloud-hosts" and tagged vlan of "cloud-private".

I've set the vlans up in netbox the way they are needed for all 4 and pushed the config to servers so we are good to go on that front.

These hosts have four drives will be fine with just one SW raid, so "partman/standard.cfg partman/raid10-4dev.cfg" looks right to me.

Note that I also changed the distro to Bookworm. We're currently upgrading all our existing hosts to Bookworm.

Change 970788 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new cloudclontrol and cloudnet to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/970788

Change 970788 merged by Papaul:

[operations/puppet@production] Add new cloudclontrol and cloudnet to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/970788