Neural Magic

Software Development

Somerville, Massachusetts 15,839 followers

High-performance inference serving solutions for you to deploy leading open-source LLMs. #SoftwareDeliveredAI

View all 50 employees

About us

Together with our community, we engineer sparse LLM, CV, and NLP models that are more efficient and performant in production. Why does this matter? Sparse models are more flexible and can achieve unrivaled latency and throughput performance on your private CPU and GPU infrastructure. Check us out on GitHub and join the Neural Magic Slack Community to get started with software-delivered AI.

Website: http://neuralmagic.com/
External link for Neural Magic
Industry: Software Development
Company size: 51-200 employees
Headquarters: Somerville, Massachusetts
Type: Privately Held
Founded: 2018
Specialties: machine learning, deep learning, and artificial intelligence

Locations

Primary

55 Davis Sq

Floor 3

Somerville, Massachusetts 02144, US

Get directions

Employees at Neural Magic

See all employees

Updates

Neural Magic reposted this

Mark Kurtz

CTO @ Neural Magic
2d Edited
Report this post
📢 Llama 3.1 is Here, and We're Actively Compressing Them! 📢 Meta unveiled their latest Llama series, featuring an impressive 405 billion parameter model surpassing OpenAI's GPT4o. This milestone significantly boosts open source and the AI community, although the largest model now requires multiple servers (810 GB!). Model compression is crucial! Our (Neural Magic) Llama 3.1 compression project is underway, aiming for cost-effective and sustainable deployments without compromising accuracy. The FP8 quantized Llama 3.1 8B model has already achieved over 99% recovery, with detailed accuracy metrics and deployment guidelines available. Also, we've introduced FP8 model support for all Llama versions in vLLM for immediate use. Explore the latest models here: - Meta-Llama-3.1-8B-Instruct-FP8: https://lnkd.in/dDAcXAAY - Meta-Llama-3.1-8B-Instruct-FP8-dynamic: https://lnkd.in/djtw4GMr For more insights, visit the vLLM Llama 3.1 Blog: - https://lnkd.in/dnZsvKjy Stay tuned for further updates—I'll be sharing more posts in the days ahead! #LLMs #vLLM #AI #MachineLearning #Quantization #NeuralMagic

Like Comment Share
Neural Magic

15,839 followers
12h
Report this post
Our bi-weekly vLLM office hours continue today. Hear from Eldar Kurtić about model quantization for efficient vLLM inference. In two weeks, learn from Roger Wang about multimodal models in vLLM. Link in comments.
2 Comments

Like Comment Share
Neural Magic reposted this

Mark Kurtz

CTO @ Neural Magic
1d
Report this post
📢 FP8 Quantized Llama 3.1 70B Now Available! 📢 Continuing our (Neural Magic) Llama 3.1 compression project, the first versions of the FP8 quantized 70B model are ready, achieving ~100% recovery. This milestone reduces the model size to 70 GB, enabling deployments on a single H100 or A100 GPU instead of two (resulting in a 50% cost reduction!) for the FP16 version. Additionally, it offers ~2X faster inference on the latest NVIDIA Hardware. Explore the Model Links: - FP8 Dynamic Quantization: https://lnkd.in/eKtbAJBv - FP8 Static Quantization: https://lnkd.in/eQDSj_9i For more insights, check out the vLLM Llama 3.1 Blog: - https://lnkd.in/dnZsvKjy Previous Llama 3.1 8B FP8 Post: - https://lnkd.in/eWPXcUVj Stay tuned for further updates—I'll be sharing more posts in the days ahead! #LLMs #AI #MachineLearning #Quantization #vLLM #OpenSource

1 Comment

Like Comment Share
Neural Magic

15,839 followers
3d
Report this post
Join us for bi-weekly vLLM Office Hours this Thursday, July 25! Eldar Kurtić, a model optimization expert, will show us how to quantize LLMs for fast and efficient inference in #vLLM. We'll have ample time for questions and community discussion. See you there!
Eldar Kurtić

Machine Learning
3d

In this week's installment of 'vLLM Office Hours,' we'll explore the when, why, and how of quantizing LLMs for efficient inference. We'll have plenty of time for questions and open discussion. Register at: https://lnkd.in/dvSmUrVe
Like Comment Share
Neural Magic

15,839 followers
1w
Report this post
It's a big day for us - we've joined the Linux Foundation! We are thrilled to collaborate with the greater ecosystem and accelerate #opensource AI innovation.

LF AI & Data Foundation

3,352 followers
1w

We are thrilled to announce Neural Magic has joined LF AI & Data as a Premier Member! 🚀 Neural Magic is at the forefront of enabling enterprise deployment of leading #opensource large language models (#LLMs) across a broad set of infrastructure, securely, whether that’s in the cloud, a private data center or at the edge. ✅ “Our vision at Neural Magic is The Future of AI is Open, so it would seem only natural to be joining the LF AI & Data as a Premier Member. We look forward to contributing within the LF AI & Data community to help make our vision become a reality.” - Brian Stevens, Neural Magic CEO 🔗 Read the full announcement: https://lnkd.in/e8ASy9VP #linuxfoundation #oss #opensource #lfaidata

Neural Magic Joins LF AI & Data as a Premier Member

https://lfaidata.foundation

1 Comment

Like Comment Share
Neural Magic

15,839 followers
1w
Report this post
EXCITING NEWS: Neural Magic and Anyscale contributed FP8 quantization support to vLLM, making LLM inference even more efficient. FP8 reduces latency on NVIDIA GPUs by 2x with >99% accuracy preservation. Thank you to NVIDIA AI for validating our benchmarks! 🔍 What is FP8? FP8 is a modern quantization format that balances precision and efficiency with hardware acceleration on newer GPUs. It reduces memory usage significantly, enabling more cost-effective LLM deployments and higher throughput. 📈 Performance gains: FP8 delivers up to 2x Inter Token Latency (ITL) improvement for Llama 3 70B, 1.6x ITL improvement for Mixtral 8x7B, and up to 3x throughput improvement on 2 NVIDIA H100 GPUs. Memory savings allow for larger batch sizes, boosting performance across various models. Our blog contains specific accuracy details. ✅ Model accuracy: We validated the accuracy preservation of FP8 in vLLM through lm-evaluation-harness comparisons on Open LLM Leaderboard v1 tasks. Most models experience over 99% accuracy preservation compared to the unquantized baseline. 🛠️ Get Started: You can now try out FP8 support in vLLM using a quantized FP8 checkpoint. Access Neural Magic's growing list of accuracy-verified quantized FP8 checkpoints of popular LLMs on our Hugging Face Model Hub. Ready to use with vLLM: https://lnkd.in/gTimN5dZ 🗓️ Learn more: See our blog for more detailed FP8 insights and join our bi-weekly vLLM Office Hours to regularly hear from and give feedback to the vLLM committer community. https://lnkd.in/g2suBKvr 🙏 Thank you for reading and please spread the word about FP8 in vLLM by sharing this post.
3 Comments

Like Comment Share
Neural Magic

15,839 followers
2w
Report this post
🗓 Happening today at 2PM EST! Learn why vLLM is the leading open-source inference server and how Neural Magic works with enterprises to build and scale vLLM-based model services. https://hubs.li/Q02DVnBd0

Deploy Open-Source LLMs with vLLM and Neural Magic

https://neuralmagic.com

Like Comment Share
Neural Magic

15,839 followers
2w
Report this post
FP8 quantization is now available in vLLM - check it out! Quantized inference is one of the best ways to reduce the costs of LLM deployments.
Anyscale

25,293 followers
2w

We’ve recently contributed FP8 support to vLLM in collaboration with Neural Magic -- with this feature, you can see up to a 1.8x reduction in inter-token latency, with >99% accuracy preservation! A common concern with FP8 is whether users will experience accuracy degradation. To address this, Neural Magic has produced many checkpoints for key models with >99% accuracy preservation across a wide range of benchmarks (https://lnkd.in/gTimN5dZ), including: - Llama3-70b - Mixtral 8x7b - Llama3-8b You can easily try this out on vLLM, and read more about the feature here -- https://lnkd.in/gzKJqerB
Like Comment Share
Neural Magic

15,839 followers
2w
Report this post
Our bi-weekly vLLM Office Hours continue tomorrow. We are excited to bring Philipp Moritz and Cody Yu from Anyscale for a deep dive into FP8 quantization in vLLM. This is an exciting opportunity to give feedback and get your questions answered. Join us: https://lnkd.in/euF8m73q
1 Comment

Like Comment Share
Neural Magic

15,839 followers
3w
Report this post
Are you looking to optimize your #LLM inference for more performance and lower costs? Tune in to hear Eldar Kurtić, our Sr. ML Researcher, break down how quantization can optimize LLM inference and reduce memory footprint without compromising model accuracy.

Eldar Kurtić

Machine Learning
3w

The second episode of the "Efficient Inference through Sparsity and Quantization" podcast series is out now. In the first episode, I talked about how sparsity can enhance the performance and efficiency of machine learning models, leading to significant cost reductions on both CPUs and GPUs. In this newly released episode, we dive deep into quantization techniques. Discover how quantization can further optimize model inference and reduce memory footprint without compromising accuracy. Listen to the second episode here: https://lnkd.in/duK8ijTC

57. Eldar Kurtic - Efficient Inference through sparsity and quantization - Part 2/2

https://spotify.com

Like Comment Share

Browse jobs

Funding

Neural Magic 3 total rounds

Last Round

Series A Nov 5, 2021

US$ 30.0M

Investors

New Enterprise Associates 4 Other investors

See more info on crunchbase

Neural Magic

Software Development

Somerville, Massachusetts 15,839 followers

High-performance inference serving solutions for you to deploy leading open-source LLMs. #SoftwareDeliveredAI

About us

DeepSparse

Deep Learning Software

SparseML

Deep Learning Software

SparseZoo

Deep Learning Software

Locations

Employees at Neural Magic

Dimitri Sirota

BigID - Know Your Data | Control Your Data

Jamie Goldstein

Brian Stevens

CEO at Neural Magic. Ex CTO & VP Google Cloud, CTO & EVP Red Hat.

Gil Beyda

Founder & Managing Partner at Genacast Ventures

Updates

Join now to see what you are missing

Similar pages

Ultralytics

Deci AI (Acquired by NVIDIA)

Roboflow

Weights & Biases

Run:ai

Cerebras Systems

Hugging Face

OmniML

Layer AI

V7

Browse jobs

Scientist jobs

Engineer jobs

Analyst jobs

Machine Learning Engineer jobs

Data Scientist jobs

Software Engineer jobs

Developer jobs

Marketing Manager jobs

Associate Product Marketing Manager jobs

Marketing Project Manager jobs

Vice President jobs

Quality Associate jobs

Manager jobs

Component Engineer jobs

Intern jobs

Associate jobs

Python Developer jobs

Microbiologist jobs

Solutions Architect jobs

Operational Specialist jobs

Funding