Research from Artificial Analysis highlights that AI infrastructure is not 'one-size fits all'. The AMD #MI300X is emerging as a viable alternative to Nvidia #H100 that delivers more cost-effective LLM inference. Optimising for price and performance means choosing the right GPU for your workload. Nscale is the GPU cloud engineered for AI, delivering a massive scale of AMD and NVIDIA GPUs, powered sustainably and optimised for your AI training, fine-tuning and inferencing workloads.
Llama 3.1 405B could be the catalyst for much greater #AMD adoption for AI inference 📈 AMD's MI300X chip may be uniquely suited to cost-effective Llama 3.1 405B inference. Its 192GB of memory allows a single 8xMI300X node to serve Llama 3.1 405B in its native FP16 precision - whereas two 8xH100 nodes are required on NVIDIA. As we have previously covered, a single NVIDIA 8xH100 node only has 640GB of memory - not enough to hold Llama 3.1 405B’s full 810GB of FP16 weights in memory at once. This means that providers are forced to deploy two 8xH100 nodes with interconnect to serve 405B in FP16 precision, forcing them to accept a significant cost and complexity penalty. Nvidia’s future H200 and B100 come with 141GB and 192GB of high bandwidth memory respectively - but unlike those, AMD MI300X is available now. Lisa Su noted on AMD’s Q2 earnings call that AMD was demand-limited on MI300X for the remainder of 2024. Will Llama 3.1 405B alone flip that narrative? We are starting to see adoption and support increase. Both Fireworks AI and Lepton AI are hosting Llama 3.1 405B on AMD MI300X chips. They stand out as the lowest cost providers of Llama 3.1 405B. However, it is important to note they are serving the model at FP8 and INT8 precision respectively. Furthermore, projects like GPU.cpp from Answer.AI are making it easier than ever to write and run portable code across different chip (hardware & software) architectures - decreasing the CUDA lock-in. What is your view? Long #AMD?