Page MenuHomePhabricator

Memory upgrade request for prometheus200[56]
Closed, ResolvedPublic

Description

Quote/Hardware Request & Specifications

Hello! Due to the work in T350592, T359633, SLO onboarding and other metrics initiatives we’re seeing significant metrics growth and increases in resource utilization on the prometheus hosts. To help cope we would like to plan a vertical scale up of the eqiad/codfw prometheus hardware hosts in-place, by way of memory upgrades.

To recap the current setup — Today the prometheus systems are running 192G with mixed memory speeds, with dual Xeon(R) Silver 4114 CPUs of max memory speed 2400MHz.

Current memory slot layout:
A1 32GB DDR4 3200 (HMA84GR7DJR4N-XN or 36ASF4G72PZ-3G2J3)
A2 32GB DDR4 3200 (HMA84GR7DJR4N-XN or 36ASF4G72PZ-3G2J3)
A3 32GB DDR4 2666 (HMA84GR7DJR4N-VK or M393A4K40BB1-CRC)

B1 32GB DDR4 3200 (HMA84GR7DJR4N-XN or 36ASF4G72PZ-3G2J3)
B2 32GB DDR4 3200 (HMA84GR7DJR4N-XN or 36ASF4G72PZ-3G2J3)
B3 32GB DDR4 2666 (HMA84GR7DJR4N-VK or M393A4K40BB1-CRC)

We’d like to upgrade each of these hosts to 384G ram each, and at the same time move to a speed matched (for simplicity/compatibility) and balanced DIMM layout for best performance. Based on my understanding of https://www.dell.com/support/manuals/en-us/poweredge-r440/per440_ism_pub/general-memory-module-installation-guidelines?guid=guid-acbc0f13-dedb-492b-a0b0-18303ded565a&lang=en-us this would translate to a final config of:

Proposed memory layout:
6x 32G 3200 DIMMs in slots A1-A6
6x 32G 3200 DIMMs in slots B1-B6

Taking into account the 4x 32G DDR4-3200 sticks already present in the servers — stepwise this is asking for something like (per host):

  • Obtain/purchase 8x 32GB DDR4 3200 DIMMs (of matching spec to the existing 4x DDR4 32GB 3200 DIMMs)
  • Take downtime on one server at a time
  • Remove the 2x DDR4 32GB 2666 DIMMs in slots A3 and B3 (for spares/discard)
  • Install 4 new DDR4 32GB 3200 sticks in slots A3-A6 and
  • Install 4 new DDR4 32GB 3200 sticks in slots B3-B6

Please double check me on this proposal and let me know if I can answer any questions, thanks!

Need By Date

Earliest reasonable date (non-emergency)

Budget Details

Add to Q3/Q4 expendables on the Upcoming Procurement Gsheet.

Refresh / Replacement / Expanding / New Service

Upgrading prometheus hosts in place

Hostname / Racking / Installation Details

Coordinate hardware installation with Keith Herron for scheduling of host downtime.

Quote Review

This section will list/link to each quote for review.

Order Details

This section will be updated to list the order details.

Event Timeline

@Papaul & @Jhancock.wm - was this one completed already via a different task?

T354685 looks like it was upgraded in January, but this task was created afterwards on March 25. @herron - do you still need this request done?

I didn't realize it at the time, though codfw in T354685 got 192GB per host, whereas a week later in T354684 eqiad got 384GB per host.

If we can bring prometheus200[56] to also 384GB each (i.e. add 192GB per host) that would be great; is that something you have on hand @Papaul @Jhancock.wm ?

I don't have any new on hands. But I can pull some of the extra dimm from the decommissioned servers. I'll leave enough to make sure the servers being recycled are still operational if that's what we want to do. Sound good?

I don't have any new on hands. But I can pull some of the extra dimm from the decommissioned servers. I'll leave enough to make sure the servers being recycled are still operational if that's what we want to do. Sound good?

That sounds great, thank you @Jhancock.wm !

I got the ram out of the servers. I can do the addition tomorrow morning at 8am CDT (1600 UTC) or another time that works for you.

I got the ram out of the servers. I can do the addition tomorrow morning at 8am CDT (1600 UTC) or another time that works for you.

Indeed, @herron kindly volunteered to assist you with this

I got the ram out of the servers. I can do the addition tomorrow morning at 8am CDT (1600 UTC) or another time that works for you.

Would 10a Central / 11a Eastern tomorrow morning work?

This was completed yesterday (during stashbot outage, this task unfortunately missed the !log)