Quadric’s Post

View organization page for Quadric, graphic

11,959 followers

4mo

New ML networks far outperform old ones. What can you do with old hardware? Nothing. Your hardware needs to be programmable to use the new networks. https://lnkd.in/g8KeaZnB

New ML Networks Far Outperform Old Standbys

https://quadric.io

To view or add a comment, sign in

More Relevant Posts

Dave Johnson

Founder and President at FirstShift Technologies
4mo
Report this post
Check out the new blog post from Quadric about how the Chimera GPNPU can help future-proof your AI/ML SoC for the ML networks that will inevitably come down the road.

Quadric

11,959 followers
4mo

New ML networks far outperform old ones. What can you do with old hardware? Nothing. Your hardware needs to be programmable to use the new networks. https://lnkd.in/g8KeaZnB

New ML Networks Far Outperform Old Standbys

https://quadric.io
Like Comment
To view or add a comment, sign in
Rashid Attar

VP, Cloud Computing, Qualcomm
4mo
Report this post
LLMs continue to push boundaries of performance and compute needs. The sheer scale of these models means a 70B parameter model could need a whopping 280GB memory requirement in FP32 which cannot be accommodated on any single accelerator. The challenge? Deploying these giant models in enterprise settings without breaking the bank on inference infrastructure. Cerebras Systems and Qualcomm are working together to deliver inference performance improvement through hardware-aware LLM training and deployment. 👉 Read the blog for a deep dive into these cutting edge techniques. https://lnkd.in/gZJ3qg6V

Cerebras and Qualcomm Unleash ~10X Inference Performance Boost with Hardware-Aware LLM Training - Cerebras

https://www.cerebras.net

1 Comment
Like Comment
To view or add a comment, sign in
Prompt Engineer Collective

773 followers
5mo
Report this post
SemiAnalysis wrote a great deep dive into Groq's cost. In short, it's expensive, and they conclude: "The question that really matters though, is if low latency small model inference is a large enough market on its own, and if it is, is it worth having specialized infrastructure when flexible GPU infrastructure can get close to the same cost and be redeployed for throughput or large model applications fairly easily." So what do you think 🤔? Is it worth having specialized infrastructure (Groq), when flexible GPU infrastructure can get close to the same cost?

SemiAnalysis

5,553 followers
5mo

Pretty much everyone has been doing the math on the Groq inference wrong We go through it in detail, including chip, package, system, networking, and power We compare throughput vs latency optimized H100 systems Future model scaling, Speculative decoding https://lnkd.in/dhcaSb9v

Groq Inference Tokenomics: Speed, But At What Cost?

semianalysis.com
Like Comment
To view or add a comment, sign in
Binh Pham

random guy doing random thing at Mindrove, Bevisioneer and UpYouth
4mo Edited
Report this post
That 2 second inference time is good, but the trade off is that you need 72 RU (576 LPU * 230MB SRAM, 8 LPU/RU) instead of 1U (2 H100 * 80GB HBM3 w/ SuperMicro 1U 2GPU line) for Mixtral 8x7B to be served comfortably. Another good thread for this is on HackerNews at https://lnkd.in/d2TtxhmN

SemiAnalysis

5,553 followers
5mo

Pretty much everyone has been doing the math on the Groq inference wrong We go through it in detail, including chip, package, system, networking, and power We compare throughput vs latency optimized H100 systems Future model scaling, Speculative decoding https://lnkd.in/dhcaSb9v

Groq Inference Tokenomics: Speed, But At What Cost?

semianalysis.com
Like Comment
To view or add a comment, sign in
Richard Aragon

I create unique products that agencies can white label for easy additional revenue
9mo
Report this post
I tested this up and down with a Phi-1.5 model. Works very nice. There are optimizations that can be made for sure. It does exactly what I wanted it to do though. Fine tuning is ded. This algorithm gives an LLM model a 'long term memory' instead. You can adjust the size of it, depending on your needs and your hardware. More memory = more GPU. Simple equation. You can scale to infinity in theory though. Think of it like hooking up a database to the LLM. But you can make the database VERY flexible. You can even make it add to itself via your conversations.

Kadane's Sliding Window: Unlimited Memory For Any LLM Model

turingssolutions.com
Like Comment
To view or add a comment, sign in
Yogendra Singh

Quality Assurance Analyst @Ultralytics | Empowering 87K Learners on YouTube | Proficiency in Python Selenium WebDriver | Appium | Automation | PyTest | Docker |Jenkins CI/CD | LambdaTest
4mo
Report this post
Great guide on optimizing Ultralytics YOLOv8 models for NCNN format – essential reading for anyone looking to deploy on mobile or embedded systems. #YOLOv8 #ComputerVision #ModelOptimization

Ultralytics

53,454 followers
4mo

Optimize Ultralytics YOLOv8 models for efficiency: A guide to exporting to NCNN format 🔥 Deploying computer vision models on devices with limited resources, such as mobile or embedded systems, presents unique challenges. By optimizing your Ultralytics YOLOv8 models for lightweight deployment through conversion to NCNN format, you can significantly improve their performance on a wide range of devices. This guide offers step-by-step instructions to seamlessly convert your models to NCNN format, ensuring they operate efficiently on mobile and embedded platforms. Learn more ➡️ https://lnkd.in/ecPeumqa #computervision #ncnn #objectdetection #yolov8

NCNN

docs.ultralytics.com
Like Comment
To view or add a comment, sign in
Ultralytics

53,454 followers
4mo
Report this post
Optimize Ultralytics YOLOv8 models for efficiency: A guide to exporting to NCNN format 🔥 Deploying computer vision models on devices with limited resources, such as mobile or embedded systems, presents unique challenges. By optimizing your Ultralytics YOLOv8 models for lightweight deployment through conversion to NCNN format, you can significantly improve their performance on a wide range of devices. This guide offers step-by-step instructions to seamlessly convert your models to NCNN format, ensuring they operate efficiently on mobile and embedded platforms. Learn more ➡️ https://lnkd.in/ecPeumqa #computervision #ncnn #objectdetection #yolov8

NCNN

docs.ultralytics.com
Like Comment
To view or add a comment, sign in
Ed Henry

Distinguished Engineer - Artificial Intelligence at Dell Technologies
9mo
Report this post
Living at the intersection of infrastructure and the current push in the space of Transformer models has really started to become fun! Here we see the ML community discovering lessons learned when building message-passing frameworks requiring sequential control, channel access, and ordering. [1] Otherwise known as Token Ring. It's also interesting that the TPUv4 pods are backed by optical interconnects, creating ToR optical rings between TPUs.[2] Another case of the hardware lottery?[3] [1]: https://lnkd.in/es76iTze [2]: https://lnkd.in/eR2TY7tX [3]: https://lnkd.in/eUqVV5SQ

Ring Attention with Blockwise Transformers for Near-Infinite Context

arxiv.org

1 Comment
Like Comment
To view or add a comment, sign in
Ben Dickson

Software Engineer | Tech Blogger
2mo
Report this post
What do you do when you can't get your hands on a bunch of A100s and H100s for LLM inference? You combine them with consumer-grade GPUs so that you can distribute operations to the most suitable type of hardware. "Attention offloading" creates a heterogeneous hardware stack and sends KV cache and attention operations to memory-optimized hardware to keep high-end accelerators for compute-intensive ops. The result: You cut the costs of inference by a considerable margin and get a hefty boost in throughput for the same price. One of the most interesting papers I've reviewed in a while. https://lnkd.in/eJxZuRGx

How attention offloading reduces the costs of LLM inference at scale

https://venturebeat.com
Like Comment
To view or add a comment, sign in
Simondavide Tritto

WW Performance Digital COE Manager at Advantest
1mo
Report this post
AI in the palm of your hand.

Snapdragon Computex 2024 Keynote: The PC Reborn

https://www.youtube.com/
Like Comment
To view or add a comment, sign in

11,959 followers

View Profile Follow

Quadric’s Post

More Relevant Posts

Snapdragon Computex 2024 Keynote: The PC Reborn

https://www.youtube.com/

Explore topics