Evolution of Data Center Networking Designs and Systems for AI Infrastructure – Part 1

Evolution of Data Center Networking Designs and Systems for AI Infrastructure – Part 1

Traffic patterns between applications that run in data center server CPUs and their quality of service requirements drive design and evolution of networking in servers and switches that connect them. Thanks to breakneck advances in Artificial Intelligence (AI) with Large Language Models (LLMs) [1] and Recommender systems [2], GPUs as AI processing accelerators in data center servers are driving networking systems scale and performance requirements more than ever before. 

This article is written at a level suitable for product managers.  It covers the significant changes occurring in data center network designs as a result of modern AI training and inference applications, focusing on use of popular off-the-shelf GPUs from merchant silicon vendors like NVIDIA and AMD.  The resulting impacts on the evolution of networking silicon and system-level features are discussed highlighting key observations as food for thought.  This article, presented in several smaller easily readable parts, concludes in the final part with my perspectives on how networking silicon and system designs in this space may further evolve.

The Evolution of AI Networking

East west traffic generated by new applications in data center server CPUs ushered in the age of high bandwidth, high radix Ethernet switching silicon and systems. Software defined and storage networking moved substantially more infrastructure processing load into servers, burdening the CPU and starving applications.  These changes heralded the era of the SmartNIC or Data Processing Unit (DPU) solutions used in data center servers.  Remote DMA or RDMA [3] networking-capable solutions had modest beginnings, limited to relatively smaller InfiniBand or RDMA over Converged Ethernet (RoCE) [4]-based storage and high performance computing (HPC) clusters.

Figure 1: Evolution of the data center infrastructure with the rise of GPU compute for AI

Of late, thanks to tremendous advances in AI and the resultant GPU-to-GPU networking needs skyrocketing, use of RDMA has experienced massive growth [5] and the term has become a household name in the data center industry. The left hand side of figure 1 shows a traditional data center design with CPU servers.  With the growth of AI applications and use of GPUs in servers, the traditional data center network comprising Ethernet switches and network adapters (or NICs) used in servers has become the “frontend network” as shown in the right hand side of figure 1.  This network deals with movement of data among modern applications that run in CPUs and storage appliances (the east-west traffic), and data to other data centers or the Internet (the north-south traffic). A new network called the “backend network” has evolved with the sole purpose of handling data movement between GPUs.  RDMA is a foundational requirement in this network.

GPUs used to process AI training and inference algorithms move data between them orders of magnitude more than CPUs have in the past.  In fact, the volumes are so big and the quality of service requirements are so stringent that the paradigm of the backend network is very different from the traditional frontend network.  The flow of traffic in the backend network is almost pre-composed and then executed like a very tightly coupled complex orchestra routine with precise and well-planned moves by what are called collectives operations and collectives communication libraries or CCL (e.g., NCCL for NVIDIA GPUs and RCCL for AMD GPUs) [6].  One misbehaving element (an overburdened GPU or a congested link between a few of them) could throw the entire routine into disarray; as a result training algorithms could take way longer to complete, recommender systems could fail to do so in time annoying users, and expensive infrastructure (GPUs cost a lot, use a lot of power) could go underutilized.  The collectives operations and libraries that orchestrates data movement in the backend network for GPU-to-GPU communication become the maestro‘s baton that keeps things in order.  Networking systems used in backend networks are evolving faster, trying to keep up with growth in the scale and sophistication of generative AI applications.

The backend network has two parts to it.  Proprietary GPU fabrics like NVLink and Infinity Fabric (as shown in figure 1) are used for GPU-to-GPU communication within the server or for smaller clusters of NVIDIA and AMD GPUs respectively.  This part of the backend network is called the “scale-up” network.  For communication between a larger set of GPUs, a complementary “scale-out” network is used, which today is serviced mostly using InfiniBand switches and host adapters used in servers.  The tightly coupled complex orchestra routine applies more so to the proprietary scale-up network built with GPU fabrics but has become equally relevant to the scale-out network for large scale GPU clusters that are increasingly becoming the norm for AI training applications.

Backend Scale-out Networking Trends and Implications

The volume of sales of networking gear for servers and switches into the backend network is growing exponentially [5].  The sizes of these networks are growing to connect to scale of 100,000 GPUs or more [7].  Ethernet switch and NIC silicon vendors see the continued use of InfiniBand in the scale-out portion of this network as a significant threat.  InfiniBand is not considered the right technology given its limited scale (clusters with few thousands of GPUs) and supplier diversity (NVIDIA is the only one). The large Ethernet community has acknowledged deficiencies in Ethernet, mostly related to the use of RDMA and congestion management at scale. The Ultra Ethernet Consortium (UEC) [8] was formed in 2023 to address these architectural and technology challenges to enable replacement of InfiniBand with Ethernet in the backend network as shown in figure 2. One of the core goals of the UEC is to define a new RDMA transport for Ethernet that scales much better than version 2 of the RoCE specification (or RoCEv2).

Figure 2: Enhancing Ethernet to replace InfiniBand while addressing higher scale needs

With the explosive growth of backend network deployments, the implications on current and future networking systems and silicon design have become a vibrant topic:

Observation #1:

On the right hand side of figure 2, note the increase in the number of Ethernet (blue) links from the CPU and GPU servers.  Each server is now equipped with one or two frontend Ethernet NICs per CPU (one for storage and one for other general purpose CPU traffic), and one RDMA-capable Ethernet NIC per GPU in the server.  This is a departure from traditional CPU-only servers that typically housed only one NIC (or two for reliability). The NICs connect to the CPU or GPU using PCIe, mandating the need for one or more PCIe switches in the server design.   

Observation #2:

In servers with CPUs and GPUs, two distinct kinds of NIC silicon are emerging, one for the frontend and one for the backend network.  Terms like SuperNIC (by NVIDIA) [9] and AI NIC (by AMD) [10] are beginning to be associated with the NIC for the backend network.  Below are some key differences between the two:

  • The frontend Ethernet NIC typically supports up to 200Gbps bandwidth each.  This NIC must support storage-related accelerations including RDMA and NVMe over Fabrics [11], and may also support multi-tenancy and security policy offload processing. This NIC – as a SmartNIC or DPU - includes silicon with an ARM CPU complex for control plane and other processing offloads from the host CPU.

  • The backend Ethernet NIC – also called the SuperNIC or AI NIC - supports up to 400Gbps bandwidth today.  This NIC is coupled one-to-one with a GPU and must maintain cadence and support higher bandwidth with release of each new GPU generation.  RDMA and advanced congestion management capabilities at scale are critical requirements, including the ability to support future UEC-defined transports and congestion handling mechanisms. The NIC must also support collectives communication libraries [6] and acceleration of frequently-used collectives functions for enabling faster and more efficient GPU-to-GPU communication.

Observation #3:

The Ethernet TOR and spine switch systems used in backend networks must complement the RDMA and congestion management schemes implemented on the server and RDMA NICs.  They implement advanced traffic management schemes and telemetry for congestion notification tuned for RDMA and GPU-to-GPU traffic (e.g., long lived, high bandwidth flows).  Some switch silicon designs implement a virtual output queue or VOQ-based scheduled fabric [12] for congestion-free delivery of packets within the leaf/TOR and spine network.  Others implement acceleration of collective operations like All-Reduce to optimize GPU-to-GPU traffic in the network and reduce congestion choke points. In general, the congestion management schemes are implemented independently in the NIC and the switch and are designed to interoperate across multi-vendor implementations of each.  NVIDIA stands out with its Spectrum-X solution [13] that highlight the values of tighter coupling between its Bluefield SuperNICs and Spectrum Ethernet switches.

The above observations to me are great food for thought that warrant healthy discussions and I welcome them.

Food for thought #1:

Servers used for AI contain many NIC and PCIe switch-related systems and silicon.  They connect to the CPU or GPU.  Many servers available today include multiple CPUs and GPUs (discussed further in a future part of this article).  That substantially increases the number of networking-related silicon in the server.  Do we expect to see consolidation in this space where more functions or capacity get subsumed into the same silicon?  

Food for thought #2:

Will the SuperNIC or AI NIC as a new server networking product category become the highest growth segment in the coming years? 

Food for thought #3:

Use of collectives libraries (like NCCL/RCCL) pre-plan and orchestrate GPU-to-GPU traffic in the network comprising NICs and switches. They support popular AI frameworks (like PyTorch) that run in servers. Given the tight dependency of application libraries/APIs in servers with traffic patterns that flow through switches, should we expect more tighter coupling between backend NICs and switches for improved congestion management? 

What's Next?

We discussed backend scale-out networking trends and implications. The scale-up network is tightly entwined with the scale-out network.  We will begin part 2 of this article with a discussion on backend scale-up networking trends and implications to networking system and silicon features.

References and Further Readings

[1] Large Language Models

[2] Recommender Systems

[3] Remote DMA or RDMA

[4] RDMA over Converged Ethernet or RoCE

[5] RDMA Deployments Skyrocket

[6] Collective Communication Libraries: NCCL, RCCL, xCCL Survey

[7] Future Backend Networking Scale - Connecting up to 600,000 GPUs

[8] Ultra Ethernet Consortium (UEC)

[9] SuperNIC by NVIDIA

[10] AI NIC by AMD

[11] NVMe over Fabrics

[12] Broadcom Switches for AI with VOQ-based Scheduled Fabric

[13] NVIDIA Spectrum-X Networking Platform

 

About the Author:

As a product management leader, Sujal Das has been at the forefront of cloud and data center networking initiatives for almost two decades.  While at Mellanox/NVIDIA, Broadcom, Netronome, and Microsoft, he has helped build and deliver advanced server and switch networking products and technologies that serve as foundations of today’s AI networking infrastructure.  Currently, he is consulting as the head of product management at Enfabrica, a well-funded silicon startup that is tackling the AI networking scale and congestion management challenges with innovative approaches.

Sudeshna Guha

Senior System SW Engineer at NVIDIA

4mo

The two-part article provides a neat overview and important insights on the topic of backend networks. The compilation of references across the two parts of the artcicle series was particularly useful. Looking forward to future articles on this series.

Mike Capuano

B2B CMO, GTM Leader, Category Creator, AI Infrastructure

4mo

Excellent article - everything aligns with what I have learned as well. I think I'll have a few questions on part 2. Thank you for creating this post!

Digesh Patel

Cloud SDE at Intel. IPU/DPU/Switches | Dataplane | P4

5mo

Can't have a better article than this to understand the high level infrastructure that is driving the AI workload by inter connecting CPU, GPU and NICs. Thanks for sharing!

Andrew Randall

Cloud native, Linux, & open source at Microsoft | Startup advisor

5mo

Appreciate the insights Sujal Das - AI is driving change at every layer.

Karun Sharma

Sr. Engineering Manager at Seagate Technology

5mo

Interesting article... Thanks for sharing.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics