December 06, 2024
Accelerating 2D Dynamic Block Quantized Float8 GEMMs in Triton
2D block quantization for Float8 (FP8) holds the promise of improving the accuracy of Float8 quantization while also accelerating GEMM’s for both inference and training. In this blog, we showcase advances using Triton for the two main phases involved in doing block quantized Float8 GEMMs.
December 02, 2024
HadaCore: Tensor Core Accelerated Hadamard Transform Kernel
Quantization is a method for improving model inference speeds by compressing model weights and performing (faster) computation in lower precision data types. However, quantization can result in accuracy loss due to the presence of outliers.
November 25, 2024
Supercharging Training using float8 and FSDP2
In this blog, we will demonstrate how we achieve up to 50% throughput speedup while achieving loss and evaluation benchmark parity in training over FSDP1 bf16 training
November 21, 2024
Rebellions Joins the PyTorch Foundation as a General Member
The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Rebellions has joined as a general member.
November 18, 2024
Distilling Llama3.1 8B into 1B in torchtune
In this blog, we present a case study on distilling a Llama 3.1 8B model into Llama 3.2 1B using torchtune’s knowledge distillation recipe. We demonstrate how knowledge distillation (KD) can be used in post-training to improve instruction-following task performance and showcase how users can leverage the recipe.
November 01, 2024
Deep Dive on CUTLASS Ping-Pong GEMM Kernel
In this post, we provide an overview, with relevant FP8 inference kernel benchmarking, of the CUTLASS Ping-Pong GEMM kernel.
October 31, 2024
Deploying LLMs with TorchServe vLLM
The vLLM engine is currently one of the top-performing ways to execute large language models (LLM). It provides the vllm serve command as an easy option to deploy a model on a single machine. While this is convenient, to serve these LLMs in production and at scale some advanced features are necessary.