Interconnect Needs for LLM Inference to Drive Networking Bandwidth
The network is critical to stitching #GPUs together. To meet the requirements of user experience and performance, LLMs will need all-to-all connectivity and powerful interconnects. Emphasis on the “s” as most LLMs will benefit from multiple fabrics.
Looking at an NVIDIA deployment, this will be a mix of an #Ethernet #InfiniBand back-end network and NVLink as the two fabrics. Running with just the Ethernet/InfiniBand network and relying on the server PCIe interconnect will result in a slower processing time. The performance increases significantly by utilizing both Ethernet/InfiniBand and NVLink (soon to be NVSwitch across multiple servers). This allows for more concurrent inquiries into the LLM and a lower cost for the operator. A combination of networks built with different goals enables the performance advantage. If there weren’t one all-to-all and one non-
As users consume more LLMs via cloud operators, SaaS providers, and consumer devices, the industry should expect LLMs to be served via large cluster deployments instead of single GPUs. From a networking perspective, this will show up as multiple networks in each rack, with the amount of networking bandwidth in the server increasing significantly. For example, in most large LLM deployments, there will be an Ethernet front-end network, Ethernet/InfiniBand back-end network, and NVLink/UALink/Other GPU fabric network. The multiple networks are the key to unlocking the value of the LLMs and GPUs and increasing the speed of adoption in AI.
Read the blog here!
https://lnkd.in/e5MjT9Ck