Immersion Datacenter Cooling: Future-Proofing

Immersion Datacenter Cooling: Future-Proofing

As chips and designs continue to push boundaries with ever higher Thermal Power Densities (TPDs), managing the dissipated heat becomes increasingly challenging.

Current air-cooled rack designs typically top out at around 10-20kW per rack. However, when you consider the Thermal Power Densities of CPUs and GPUs, it quickly becomes apparent that filling a rack with these heat-producing components becomes a limiting factor.

No alt text provided for this image
AKCP's thermal mapping solution - https://www.akcp.com/blog/thermal-mapping-for-data-centers/


In a leaked Gigabyte roadmap, it has been revealed that most components are projected to double their power usage within the next 2-3 years. This power increase extends beyond CPUs to include networking components like higher bandwidth 400/800G switches, DPUs, FPGAs, and even optics.

No alt text provided for this image
From hothardware.com

AMD's MI300X isn't listed here, but has increased power usage 34% from last gen, 560W to 750W.

While power consumption has historically been primarily a CPU concern, the rise of GPUs and their associated power requirements has accelerated the issue. Companies that offer denser solutions have traditionally prioritized power efficiency over performance. For instance, Ampere's Altra processors, which are based on ARM architecture (similar to that of smartphones and tablets), have typically operated at around 100W each, however they too are scaling up their power usage (and core count) up to 350W CPU chips with AmpereOne.

However, it's important to note that historically deploying these processors for enterprise and high-performance computing workloads, which are commonly run on x86 environments, often requires significant retooling. Despite their higher core counts and efficiency gains, Ampere has not yet achieved widespread market penetration.


Now, let's turn our attention to NVIDIA.


As of July 13, 2023, NVIDIA stands as a formidable player with a market value of $1.13 trillion. In a groundbreaking move, they have developed their own CPU based on ARM architecture, called Grace. Coupled with their Hopper GPU architecture, they have created a "superchip" capable of drawing up to 1000W of power.


This represents a significant leap in power consumption and subsequently the heat generated when compared to previous designs. NVIDIA's entrance into the CPU market with Grace and its pairing with Hopper GPUs signifies a major shift in the landscape of power draw and computational capabilities.


As the demand for more powerful and energy-intensive components continues to grow, datacenter operators and technology companies will face new challenges in managing heat dissipation. These developments highlight the need for innovative cooling solutions, such as immersion cooling, to effectively address the escalating power and thermal demands of advanced hardware configurations.

Let's take a look at the following examples (I've used Supermicro given my experience with them, certainly not because of their SKU naming convention!)

  1. The latest 'standard' rack (2U dual processor)
  2. Current A100 HGX
  3. Latest generation H100 HGX

And then let's talk about a GH200-based solution...


NB. I have used absolute numbers when quoting power to keep things high-level and easy to understand, rather than actual/real-world power draws.


Standard Rack:

  • PSU (90% efficiency): 2x 1200W (1 1)
  • Max CPU: Dual Socket, up to 350W each
  • GPU: 4x PCIe 5.0 slots
  • Rack Units: 2U
  • Model: SYS-221H-TNR

In this configuration, with 2x 350W CPUs, you are left with less than 400W of usable power. Consequently, the 4 available GPU slots can only support one A100 80GB GPU with a total board power of 300W. This leaves a therotical 100W for the rest of the system, such as additional IO, storage, RAM, fans, etc. For instance, a Bluefield DPU consumes around 75W per card.

It is also worth noting that Supermicro offers a D2C (Direct to Chip) Cold Plate as an optional extra for this server, which is an interesting feature.


A100 HGX:

  • PSU (90% efficiency): 2x 3000W (2 2)
  • Max CPU: Dual Socket, up to 280W each
  • GPU: 9x PCIe 5.0 slots
  • Rack Units: 4U
  • Model: 4124GO-NART

With double the rack space and headroom for 4x power, using AMD's EPYC chips rated at 280W, you now have around 4500W available for 8x 300W GPUs operating at full power. This configuration also allows ample capacity for utilizing the 9th PCIe slot, as well as other resources like NVLink and NVSwitch.


H100 HGX:

  • PSU (90% efficiency): 3x 3000W (3 3)
  • Max CPU: Dual Socket, up to 400W each
  • GPU: 9x PCIe 5.0 slots
  • Rack Units: 8U
  • Model: AS-8125GS-TNHR

Doubling the rack space to 8U, the CPU power envelope per CPU increases by 42%. This setup leaves approximately 7200W for 8x H100 GPUs, each using up to 700W of power. This allows for a theoretical 1600W for the system, representing a 20% reduction from the A100 unit.


GH200:

NVIDIA's latest 'superchip,' the GH200, currently unable to be ordered by Supermicro, we can use the HGX reference design by NVIDIA.

It consists of two Grace Hopper blades housed in a 2U chassis. Each chip, combining a single 72-core ARM-based CPU and an H100 chip, can consume up to 1000W of power, thus 1U = 1000W just for the compute.


As a reminder, an air-cooled rack can give you 10-20kW, and an immersion tank can give you over 100kW!


Even through this narrow focus, these examples showcase the challenges and opportunities that arise as TPDs increase. Immersion cooling solutions offer a promising avenue for future-proofing datacenter infrastructures, enabling efficient heat management even in the face of demanding chips and designs.

Thank you, again, for joining me on this exploration of immersion cooling, and I look forward to future discussions and advancements in this exciting field.


Asperitas Carbon-Z DataQube Global Data Center Frontier Dell Technologies Equinix FUCHS Group GIGABYTE Green Revolution Cooling Iceotope Technologies Limited Inflection AI LiquidStack NVIDIA Open Compute Project Foundation Sabey Data Centers Submer Supermicro Wyoming Hyperscale 

Gregg Primm

Vice President of Global Marketing at Green Revolution Cooling

1y

Fantastic series Nick!

Paul Edmondson

EMEA VP & GM at GRC: The Immersion Cooling Authority

1y

Thanks for the mention Nick from recent events in the UK and across EMEA Immersion cooling is gaining huge interest amongst the HPC & AI communities as these are driving incredibly dense compute and the traditional air cooled offerings simply cannot support, I am always on hand should you need to pick my brains for all things immsrsion.

Robert Linsdell

General Manager, Australia, New Zealand and APAC, at Ekkosense.

1y

Good series Nick, good insights and a valuable collection of what is going on in this space. To a degree though it’s all a touch of an unknown. At a recent event I attended a long standing stalwart quoted we have gone from a datacentre 30 years ago to potentially 12 different types of datacentre from hyperscale to edge. Omer Wilson Mark Thiele

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics