Releases · EricLBuehler/mistral.rs

@Reckon-11

New features

Qwen2-VL support
Idefics 3/SmolVLM support
️‍🔥 6x prompt performance boost (all benchmarks faster than or comparable to MLX, llama.cpp)!
🗂️ More efficient non-PagedAttention KV cache implementation!
Public tokenization API

Python wheels

The wheels now include support for Windows, Linux, and Mac with x84_64 and aarch64.

MSRV

1.79.0

What's Changed

Update Dockerfile by @Reckon-11 in #895
Add the Qwen2-VL model by @EricLBuehler in #894
ISQ for mistralrs-bench by @EricLBuehler in #902
Use tokenizers v0.20 by @EricLBuehler in #904
Fix metal sdpa for v stride by @EricLBuehler in #905
Better parsing of the image path by @EricLBuehler in #906
Add some Metal kernels for HQQ dequant by @EricLBuehler in #907
Handle assistant messages with 'tool_calls' by @Jeadie in #824
Attention-fused softmax for Metal by @EricLBuehler in #908
Metal qmatmul mat-mat product (5.4x performance increase) by @EricLBuehler in #909
Support --dtype in mistralrs bench by @EricLBuehler in #911
Metal: Use mtl resource shared to avoid one copy by @EricLBuehler in #914
Preallocated KV cache by @EricLBuehler in #916
Fixes for kv cache grow by @EricLBuehler in #917
Dont always compile with fp8, bf16 for cuda by @EricLBuehler in #920
Expand attnmask on cuda by @EricLBuehler in #923
Faster CUDA prompt speeds by @EricLBuehler in #925
Paged Attention alibi support by @EricLBuehler in #926
Default to SDPA for faster VLlama PP T/s by @EricLBuehler in #927
VLlama vision model ISQ support by @EricLBuehler in #928
Support fp8 on Metal by @EricLBuehler in #930
Bump rustls from 0.23.15 to 0.23.18 by @dependabot in #932
Calculate perplexity of ISQ models by @EricLBuehler in #931
Integrate fast MLX kernel for SDPA with long seqlen by @EricLBuehler in #933
Always cast image to rgb8 for qwenvl2 by @EricLBuehler in #936
Fix etag missing in hf hub by @EricLBuehler in #934
Fix some examples for vllama 3.2 by @EricLBuehler in #937
Improve memory efficency of vllama by @EricLBuehler in #938
Implement the Idefics 3 models (Idefics 3, SmolVLM-Instruct) by @EricLBuehler in #939
Expose a public tokenization API by @EricLBuehler in #940
Prepare for v0.3.4 by @EricLBuehler in #942

New Contributors

@Reckon-11 made their first contribution in #895

Full Changelog: v0.3.2...v0.3.4

@EricLBuehler

Key changes

General improvements and fixes
ISQ FP8
GPTQ Marlin
26% performance boost on Metal
Python package wheels are available. See below and the various PyPi packages.

What's Changed

Update docs and deps by @EricLBuehler in #804
Support Qwen 2.5 by @EricLBuehler in #805
Update docs with clarifications and notes by @EricLBuehler in #806
Improved inverting for Attention Mask by @EricLBuehler in #811
Fix repeat_interleave by @EricLBuehler in #812
Use f32 for neg inf in cross attn mask by @EricLBuehler in #814
Improve UQFF memory efficiency by @EricLBuehler in #813
Update Metal, CUDA Candle impls and ISQ by @EricLBuehler in #816
chore: update pagedattention.cu by @eltociear in #822
MLlama - if f16, load vision model in f32 by @EricLBuehler in #820
ci: Upgrade actions by @polarathene in #823
docs: added a top button because of readme length by @bhargavshirin in #833
Typo in error of model architecture enum by @nikolaydubina in #835
Expose config for Rust api, tweak modekind by @EricLBuehler in #841
Add ISQ FP8 by @EricLBuehler in #832
Fix Metal F8 build errors by @EricLBuehler in #846
Bump pyo3 from 0.22.3 to 0.22.4 by @dependabot in #854
Generate standalone UQFF models by @EricLBuehler in #849
Update README.MD by @kaleaditya779 in #848
Add GPTQ Marlin support for 4 and 8 bit by @EricLBuehler in #856
Adds wrap_help feature to clap by @DaveTJones in #858
Patch UQFF metal generation by @EricLBuehler in #857
Add GGUF Qwen 2 by @EricLBuehler in #860
Avoid duplicate Metal command buffer encodings during ISQ by @EricLBuehler in #861
Fix for isnanf by @EricLBuehler in #859
Fix some metal warnings by @EricLBuehler in #862
Support interactive mode markdown bold/italics via ANSI codes by @EricLBuehler in #879
Even better V-Llama accuracy by @EricLBuehler in #881
Trim whitespace (such as carriage returns) from nvidia-smi output. by @asaddi in #880
MODEL_ID not "MODEL_ID" by @simonw in #863
Sync ggml metal kernels by @EricLBuehler in #885
Increase Metal decoding T/s by 26% by @EricLBuehler in #887
Remove pretty-printer by @EricLBuehler in #889
Fix typo in documentation by @msk in #888
fix Half-Quadratic Quantization and Dequantization on CPU by @haricot in #873
Prepare for v0.3.2 by @EricLBuehler in #891

New Contributors

@bhargavshirin made their first contribution in #833
@nikolaydubina made their first contribution in #835
@kaleaditya779 made their first contribution in #848
@DaveTJones made their first contribution in #858
@asaddi made their first contribution in #880
@simonw made their first contribution in #863
@msk made their first contribution in #888
@haricot made their first contribution in #873

Full Changelog: v0.3.1...v0.3.2

@EricLBuehler

Highlights

UQFF
FLUX model
Llama 3.2 Vision model

MSRV

The MSRV of this release is 1.79.0.

What's Changed

Enable automatic determination of normal loader type by @EricLBuehler in #742
Add the ForwardInputsResult api by @EricLBuehler in #745
Implement Mixture of Quantized Experts (MoQE) by @EricLBuehler in #747
Bump quinn-proto from 0.11.6 to 0.11.8 by @dependabot in #748
Fix f64-f32 type mismatch for Metal/Accelerate by @EricLBuehler in #752
Nicer error when misconfigured PagedAttention input metadata by @EricLBuehler in #753
Update deps, support CUDA 12.6 by @EricLBuehler in #755
Patch bug when not using PagedAttention by @EricLBuehler in #759
Fix MistralRs Drop impl in tokio runtime by @EricLBuehler in #762
Use nicer Candle Error APIs by @EricLBuehler in #767
Support setting seed by @EricLBuehler in #766
Fix Metal build error with seed by @EricLBuehler in #771
Fix and add checks for no kv cache by @EricLBuehler in #776
UQFF: The uniquely powerful quantized file format. by @EricLBuehler in #770
Add Scheduler::running_len by @EricLBuehler in #780
Deduplicate RoPE caches by @EricLBuehler in #787
Easier and simpler Rust-side API by @EricLBuehler in #785
Add some examples for AnyMoE by @EricLBuehler in #788
Rust API for sampling by @EricLBuehler in #790
Our first Diffusion model: FLUX by @EricLBuehler in #758
Fix build bugs with metal, NSUInteger by @EricLBuehler in #792
Support weight tying in Llama 3.2 GGUF models by @EricLBuehler in #801
Implement the Llama 3.2 vision models by @EricLBuehler in #796

Full Changelog: v0.3.0...v0.3.1

@EricLBuehler

Highlights

New model topology feature: ISQ and device mapping
🔥Faster FlashAttention support when batching
Removed plotly and associated JS dependencies
φ³ Support Phi 3.5, Phi 3.5 vision, Phi 3.5 MoE
Improved Rust API ergonomics
Support multiple (shaded) GGUF files

MSRV

The Rust MSRV of this version is 1.79.0

What's Changed

Fixes for auto dtype selection with RUST_BACKTRACE=1 by @EricLBuehler in #690
Add support multiple GGUF files by @EricLBuehler in #692
Refactor normal and vision loaders by @EricLBuehler in #693
Fix split.count GGUF duplication handling by @EricLBuehler in #695
Batching example by @EricLBuehler in #694
Some fixes by @EricLBuehler in #697
Improve vision rust examples by @EricLBuehler in #698
Add ISQ topology by @EricLBuehler in #701
Add custom logits processor API by @EricLBuehler in #702
Add Gemma 2 PagedAttention support by @EricLBuehler in #704
Faster RmsNorm in Gemma/Gemma2 by @EricLBuehler in #703
Fix bug in Metal ISQ by @EricLBuehler in #706
Support GGUF BF16 tensors by @EricLBuehler in #691
Better support for FlashAttention: real batching sliding window softcap by @EricLBuehler in #707
Remove some usages of pub in models by @EricLBuehler in #708
Support the Phi 3.5 V model by @EricLBuehler in #710
Implement the Phi 3.5 MoE model by @EricLBuehler in #709
Device map topology by @EricLBuehler in #717
Implement DRY penalty by @EricLBuehler in #637
Remove plotly and just output CSV loss file by @EricLBuehler in #700
Using once_cell to reduce MSRV by @EricLBuehler in #724
Fixes for Windows build by @EricLBuehler in #729
Even more phi3.5moe fix attempts by @EricLBuehler in #731
Add example for Phi 3.5 MoE by @EricLBuehler in #733
Add Phi 3.5 chat template by @EricLBuehler in #734
Patch ISQ for Mixtral by @EricLBuehler in #730
Gracefully handle Engine Drop with termination request by @EricLBuehler in #735
feat(vision): add support for proper file and data image URLs by @Schuwi in #727
Add new parsing to Python API by @EricLBuehler in #737
Remove test and add custom error type to Python API by @EricLBuehler in #738
Update kernels for metal bf16 by @EricLBuehler in #719
Better Response Result API by @EricLBuehler in #739
More Metal quantized kernel fixes by @EricLBuehler in #740
[Breaking] Bump version to v0.3.0 by @EricLBuehler in #736
Final changes for v0.3.0 by @EricLBuehler in #741

New Contributors

@Schuwi made their first contribution in #727

Full Changelog: v0.2.5...v0.3.0

@EricLBuehler

What's Changed

Refactor ISQ quant parsing by @EricLBuehler in #664
Refactor server examples to use OpenAI Python client by @EricLBuehler in #665
Implement prompt chunking by @EricLBuehler in #623
Python example and server example cleanup by @EricLBuehler in #668
Implement GPTQ quantization by @EricLBuehler in #467
Update deps by @EricLBuehler in #672
Rework the automatic dtype selection feature by @EricLBuehler in #676
Fix backend Candle fork Metal, flash attn, also Llama linear by @EricLBuehler in #681
Use converted tokenizer.json in tests by @EricLBuehler in #682
Refactor ISQ and mistralrs-quant by @EricLBuehler in #683
Fix metal build for isq by @EricLBuehler in #686
Add missing error case in automatic dtype selection feature by @ac3xx in #685
fix null in tool type response by @wseaton in #687
Implement HQQ quantization by @EricLBuehler in #677
Bump version to 0.2.5 by @EricLBuehler in #688

New Contributors

@ac3xx made their first contribution in #685
@wseaton made their first contribution in #687

Full Changelog: v0.2.4...v0.2.5

Install mistralrs-server 0.2.5

Install prebuilt binaries via shell script

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.5/mistralrs-server-installer.sh | sh

Download mistralrs-server 0.2.5

File	Platform	Checksum
mistralrs-server-aarch64-apple-darwin.tar.xz	Apple Silicon macOS	checksum
mistralrs-server-x86_64-apple-darwin.tar.xz	Intel macOS	checksum
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz	x64 Linux	checksum

@rgbkrk

What's Changed

fix build on metal by returning Device by @rgbkrk in #642
Add invite to Matrix chatroom by @EricLBuehler in #644
Make sure we don't have dead links by @EricLBuehler in #647
Fix more links by @EricLBuehler in #648
Throughput for interactive mode by @EricLBuehler in #655
Implement tool calling by @EricLBuehler in #649
Fix device map check for paged attn by @EricLBuehler in #656
Fix for mistral nemo in gguf by @EricLBuehler in #657
Fix check of cache config when device mapping PA by @EricLBuehler in #658
Biollama in tool calling example by @EricLBuehler in #659
Biollama in tool calling example by @EricLBuehler in #660
Examples for simple tool calling by @EricLBuehler in #661
Bump version to 0.2.4 by @EricLBuehler in #662

New Contributors

@rgbkrk made their first contribution in #642

Full Changelog: v0.2.3...v0.2.4

MSRV

MSRV is 1.75

Install mistralrs-server 0.2.4

Install prebuilt binaries via shell script

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.4/mistralrs-server-installer.sh | sh

Download mistralrs-server 0.2.4

File	Platform	Checksum
mistralrs-server-aarch64-apple-darwin.tar.xz	Apple Silicon macOS	checksum
mistralrs-server-x86_64-apple-darwin.tar.xz	Intel macOS	checksum
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz	x64 Linux	checksum

@EricLBuehler

What's Changed

Implement min-p sampling by @EricLBuehler in #625
Tweak handling when PA cannot allocate by @EricLBuehler in #632
Update deps by @EricLBuehler in #633
Improve penalty context window calculation by @EricLBuehler in #636
Allow setting PagedAttention KV cache allocation from context size by @EricLBuehler in #640
Bump version to 0.2.3 by @EricLBuehler in #638

Full Changelog: v0.2.2...v0.2.3

Install mistralrs-server 0.2.3

Install prebuilt binaries via shell script

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.3/mistralrs-server-installer.sh | sh

Download mistralrs-server 0.2.3

File	Platform	Checksum
mistralrs-server-aarch64-apple-darwin.tar.xz	Apple Silicon macOS	checksum
mistralrs-server-x86_64-apple-darwin.tar.xz	Intel macOS	checksum
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz	x64 Linux	checksum

@EricLBuehler

What's Changed

Fix ctrlc handling for scheduler v2 by @EricLBuehler in #614
Make sliding_window optional for mixtral by @csicar in #616
Support Llama 3.1 scaled rope by @EricLBuehler in #618

New Contributors

@csicar made their first contribution in #616

Full Changelog: v0.2.1...v0.2.2

MSRV

MSRV is 1.75.

Install mistralrs-server 0.2.2

Install prebuilt binaries via shell script

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.2/mistralrs-server-installer.sh | sh

Download mistralrs-server 0.2.2

File	Platform	Checksum
mistralrs-server-aarch64-apple-darwin.tar.xz	Apple Silicon macOS	checksum
mistralrs-server-x86_64-apple-darwin.tar.xz	Intel macOS	checksum
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz	x64 Linux	checksum

@EricLBuehler

What's Changed

Fix path normalize for mistralrs-paged-attn by @EricLBuehler in #592
ISQ python example by @EricLBuehler in #593
Add support for mistral nemo by @EricLBuehler in #595
Fix dtype with QLinear by @EricLBuehler in #600
Update paged-attn build.rs with NVCC flags by @joshpopelka20 in #604
Bump openssl from 0.10.64 to 0.10.66 by @dependabot in #605
Update GitHub issue templates by @EricLBuehler in #607
Add server throughput logging by @EricLBuehler in #608
Make the plotly feature optional by @EricLBuehler in #597
Use OnceLock for Python bindings device by @EricLBuehler in #602
Topk for X-LoRA scalings by @EricLBuehler in #609
Fix server cross-origin errors by @openmynet in #610
Refactor sampler by @EricLBuehler in #611
Bump version to 0.2.1 by @EricLBuehler in #613

New Contributors

@dependabot made their first contribution in #605
@openmynet made their first contribution in #610

Full Changelog: v0.2.0...v0.2.1

Install mistralrs-server 0.2.1

Install prebuilt binaries via shell script

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.1/mistralrs-server-installer.sh | sh

Download mistralrs-server 0.2.1

File	Platform	Checksum
mistralrs-server-aarch64-apple-darwin.tar.xz	Apple Silicon macOS	checksum
mistralrs-server-x86_64-apple-darwin.tar.xz	Intel macOS	checksum
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz	x64 Linux	checksum

@EricLBuehler

New features

Support .bin, .pt, .pth extensions
Add Starcoder 2 GGUF
🔥 PagedAttention - beating llama.cpp running GGUF plus all the throughput benefits 😉
Optimized performance and memory usage

Rust MSRV

MSRV of mistral.rs v0.2.0 is 1.75.

What's Changed

Fix SWA order (flip it) for Gemma 2 by @EricLBuehler in #554
Support .bin, .pt, .pth extensions by @EricLBuehler in #557
Update readme by @EricLBuehler in #558
Fix Starcoder 2 ISQ by @EricLBuehler in #559
Update deps by @EricLBuehler in #560
Add the starcoder2 GGUF arch by @EricLBuehler in #561
Readme update for starcoder2 gguf by @EricLBuehler in #562
Fix PyPI release trigger by @EricLBuehler in #566
Optimize multi-batch and inference performance with PagedAttention by @EricLBuehler in #552
[Breaking] Version 0.2.0 by @EricLBuehler in #527
Paged attention support for vision models by @EricLBuehler in #567
Automatically use paged attn on cuda, get memory size by @EricLBuehler in #568
Add docs link for vision loader by @EricLBuehler in #570
Add matching for valid model weight names by @EricLBuehler in #571
Remove ensure about no paged attn for vision models by @EricLBuehler in #573
Add percentage utilization support to paged attn by @EricLBuehler in #574
Include block engine in paged attn metadata by @EricLBuehler in #576
Update deps and sync Candle by @EricLBuehler in #578
Optimize CLIP model by @EricLBuehler in #579
Use softmax_last_dim in CLIP by @EricLBuehler in #580
Fix method of calculating paged attn with util percent by @EricLBuehler in #581
Handle windows in paged attn build by @EricLBuehler in #577
Warn instead of error when paged attn not supported by @EricLBuehler in #583
Warn instead of error when paged attn for adapters not supported by @EricLBuehler in #584
Add support for lm_head to adapter models by @EricLBuehler in #586
Add default plotly feature by @EricLBuehler in #587
Improve memory handling of PagedAttention with GGUF by @EricLBuehler in #590
Fix Windows build on cuda w/ PagedAttention by @EricLBuehler in #589
Update cuda kernels build.rs on windows by @EricLBuehler in #591
Bump version to 0.2.0 and update docs by @EricLBuehler in #582

Full Changelog: v0.1.26...v0.2.0

Install mistralrs-server 0.2.0

Install prebuilt binaries via shell script

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.0/mistralrs-server-installer.sh | sh

Download mistralrs-server 0.2.0

File	Platform	Checksum
mistralrs-server-aarch64-apple-darwin.tar.xz	Apple Silicon macOS	checksum
mistralrs-server-x86_64-apple-darwin.tar.xz	Intel macOS	checksum
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz	x64 Linux	checksum

Releases: EricLBuehler/mistral.rs

v0.3.4

New features

Python wheels

MSRV

What's Changed

New Contributors

Contributors

v0.3.2

Key changes

What's Changed

New Contributors

Contributors

v0.3.1

Highlights

MSRV

What's Changed

Contributors

v0.3.0

Highlights

MSRV

What's Changed

New Contributors

Contributors

v0.2.5

What's Changed

New Contributors

Install mistralrs-server 0.2.5

Install prebuilt binaries via shell script

Download mistralrs-server 0.2.5

Contributors

v0.2.4

What's Changed

New Contributors

MSRV

Install mistralrs-server 0.2.4

Install prebuilt binaries via shell script

Download mistralrs-server 0.2.4

Contributors

v0.2.3

What's Changed

Install mistralrs-server 0.2.3

Install prebuilt binaries via shell script

Download mistralrs-server 0.2.3

Contributors

v0.2.2

What's Changed

New Contributors

MSRV

Install mistralrs-server 0.2.2

Install prebuilt binaries via shell script

Download mistralrs-server 0.2.2

Contributors

v0.2.1

What's Changed

New Contributors

Install mistralrs-server 0.2.1

Install prebuilt binaries via shell script

Download mistralrs-server 0.2.1

Contributors

v0.2.0

New features

Rust MSRV

What's Changed

Install mistralrs-server 0.2.0

Install prebuilt binaries via shell script

Download mistralrs-server 0.2.0

Contributors