Stars
πA curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
Universal cross-platform tokenizers binding to HF and sentencepiece
Efficient Triton Kernels for LLM Training
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
Chat with AI large language models running natively in your browser. Enjoy private, server-free, seamless AI conversations.
Development repository for the Triton language and compiler
Yes, it's another chat over documents implementation... but this one is entirely local!
π¦π Build context-aware reasoning applications π¦π
CD-GraB is a distributed gradient balancing framework that aims to find distributed data permutation with provably better convergence guarantees than Distributed Random Reshuffling (D-RR). https://β¦
Chat Templates for π€ HuggingFace Large Language Models
Utilities to use the Hugging Face Hub API
A Easy-to-understand TensorOp Matmul Tutorial
Mixture-of-Experts for Large Vision-Language Models
The official Python library for the OpenAI API
A high-throughput and memory-efficient inference and serving engine for LLMs
AI Assistant running within your browser.
FlashInfer: Kernel Library for LLM Serving
Building blocks for foundation models.
You like pytorch? You like micrograd? You love tinygrad! β€οΈ
State-of-the-art Machine Learning for the web. Run π€ Transformers directly in your browser, with no need for a server!