Emergence’s Post

View organization page for Emergence, graphic

964 followers

3mo

Our Senior Machine Learning Engineer, Aakash Nain, released a new paper on the flow of information in a pre-trained transformer, in collaboration with Sakana AI. Read it now 👇

Aakash Nain

Senior ML Engineer | Keras Core collaborator | TensorFlow Addons Maintainer | Google Developers Expert in Machine Learning

3mo

Very happy to present our latest paper: Transformer Layers as Painters Through our paper, we aim to understand the flow of information in a pretrained transformer. We present a series of experiments for both decoder-only and encoder-only frozen transformer models. **Note that we do not perform any kind of fine-tuning on these pretrained models.** With a series of experiments done on a diverse set of datasets with both types (decoder-only e.g. Llama, Mistral like models) and (encoder-only e.g. BERT), we answer the following questions: 1. Do layers speak the same language? 2. Are all layers necessary? 3. Are the middle layers all doing the same thing? 4. Does the layer order matter? 5. Can we run the layers in parallel? 6. Does the order matter for some tasks more than others? 7. Does looping help parallelized layers? 8. Which design variants are least harmful? I would provide the link where for the full summary, and the paper in the comments section. This was an fun collaboration between Sakana AI and Emergence, and for me I take immense pride in collaborating with Marc Pickett Llion Jones and Qi sun. Enjoy reading the paper! 🍻

To view or add a comment, sign in

More Relevant Posts

Aakash Nain

Senior ML Engineer | Keras Core collaborator | TensorFlow Addons Maintainer | Google Developers Expert in Machine Learning
3mo
Report this post
Very happy to present our latest paper: Transformer Layers as Painters Through our paper, we aim to understand the flow of information in a pretrained transformer. We present a series of experiments for both decoder-only and encoder-only frozen transformer models. **Note that we do not perform any kind of fine-tuning on these pretrained models.** With a series of experiments done on a diverse set of datasets with both types (decoder-only e.g. Llama, Mistral like models) and (encoder-only e.g. BERT), we answer the following questions: 1. Do layers speak the same language? 2. Are all layers necessary? 3. Are the middle layers all doing the same thing? 4. Does the layer order matter? 5. Can we run the layers in parallel? 6. Does the order matter for some tasks more than others? 7. Does looping help parallelized layers? 8. Which design variants are least harmful? I would provide the link where for the full summary, and the paper in the comments section. This was an fun collaboration between Sakana AI and Emergence, and for me I take immense pride in collaborating with Marc Pickett Llion Jones and Qi sun. Enjoy reading the paper! 🍻

4 Comments
Like Comment
To view or add a comment, sign in
Vemosign

134 followers
4mo Edited
Report this post
Calculate with VEMetrics®. Advanced AI calculators bring us a step closer to solving a wide range of mathematical problems, from basic arithmetic to complex calculus equations.
Like Comment
To view or add a comment, sign in
ToolTech.ai

8 followers
7mo
Report this post
Let’s wait and watch how the resources (time, cost, etc.) for training LLMs will go down in the near future ✌️⚒️🤖💬 #AI #GenerativeAI #LLM
Lior Sinclair Lior Sinclair is an Influencer

Covering the latest in AI R&D • ML-Engineer • MIT Lecturer • Building AlphaSignal, a newsletter read by 200,000 AI engineers.
7mo

Researchers just developed a new method to delete 40% of LLM layers with no drop in accuracy. This makes models much cheaper and faster to use. The method combines pruning, quantization and PEFT which was tested across various open source models. Each family of models had a maximum amount of layers that could be deleted before accuracy dropped: - Mistral - 30% - Lama 70B - 40% - Llama 13B - 50% Paper in comments. ↓ Are you technical? Check out https://AlphaSignal.ai to get a weekly summary of the latest models, repos and papers in AI. Read by 170,000 engineers and researchers.
Like Comment
To view or add a comment, sign in
Lior Sinclair Lior Sinclair is an Influencer

Covering the latest in AI R&D • ML-Engineer • MIT Lecturer • Building AlphaSignal, a newsletter read by 200,000 AI engineers.
7mo
Report this post
Researchers just developed a new method to delete 40% of LLM layers with no drop in accuracy. This makes models much cheaper and faster to use. The method combines pruning, quantization and PEFT which was tested across various open source models. Each family of models had a maximum amount of layers that could be deleted before accuracy dropped: - Mistral - 30% - Lama 70B - 40% - Llama 13B - 50% Paper in comments. ↓ Are you technical? Check out https://AlphaSignal.ai to get a weekly summary of the latest models, repos and papers in AI. Read by 170,000 engineers and researchers.
45 Comments
Like Comment
To view or add a comment, sign in
Pierre-Emmanuel CHAUT

Co-founder at Concorde AI France
7mo
Report this post
After optimizing the data format for weights, it's now time to look at layer optimization.
Lior Sinclair Lior Sinclair is an Influencer

Covering the latest in AI R&D • ML-Engineer • MIT Lecturer • Building AlphaSignal, a newsletter read by 200,000 AI engineers.
7mo

Researchers just developed a new method to delete 40% of LLM layers with no drop in accuracy. This makes models much cheaper and faster to use. The method combines pruning, quantization and PEFT which was tested across various open source models. Each family of models had a maximum amount of layers that could be deleted before accuracy dropped: - Mistral - 30% - Lama 70B - 40% - Llama 13B - 50% Paper in comments. ↓ Are you technical? Check out https://AlphaSignal.ai to get a weekly summary of the latest models, repos and papers in AI. Read by 170,000 engineers and researchers.
Like Comment
To view or add a comment, sign in
Toni Pasanen

Principal Network Architect | Distinguished Engineer | Technical Author | Blogger
2w
Report this post
After a busy weekend, I have completed the third chapter of my upcoming book, AI for Network Engineers. Please note that the title is still a working title and may change. This chapter introduces Multi-Class Classification using the MNIST dataset. If you have the time to go through all eight pages, you'll learn the following: 1) Structure of the MNIST dataset 2) Forward pass in Multi-Class Classification 3) Model probability computation using the SoftMax activation function 4) Gradient calculation in the backward pass process 5) Weight parameter update process 6) Data Parallelization strategy and its impact on network link utilization. https://lnkd.in/dV6UTrze If you haven’t read the second chapter I published last week, I recommend doing so before reading this chapter.
12 Comments
Like Comment
To view or add a comment, sign in
Elvis S.

Co-founder at DAIR.AI | PhD | Prev: Meta AI, Galactica LLM, Elastic | Prompting Guide (5M learners) | I teach how to build with AI ⬇️
1mo
Report this post
Diagram of Thought (DoT) enhances the reasoning capabilities of LLMs through mathematical rigor. DAT models iterative reasoning in LLM as the construction of a directed acyclic graph. It integrated propositions, critiques, refinement, and verification into a unified DAG structure. This allows DoT to capture complex logical deduction beyond linear or tree-based approaches. No external intervention is used but it does leverage role-specific tokens to generate detailed reasoning processes. https://lnkd.in/eyCqeuX6 ↓ Join 85K AI researchers and devs so you don’t miss my weekly summary of the top AI and LLM papers: https://lnkd.in/e6ajg945
1 Comment
Like Comment
To view or add a comment, sign in
David Hart
1mo
Report this post
Operating a single generative model in a multi-agent mode is a concept I've experimented with solely through prompting. This team has taken the concept much further, finetuning a model on Diagram-of-Thought processes, similar to OpenAI's o1 finetuning on Chain-of-Thought, while also enhancing the tokenizer with agentic and reasoning specific tokens. I suspect that this entire broad approach will become a wave of future development.
Elvis S.

Co-founder at DAIR.AI | PhD | Prev: Meta AI, Galactica LLM, Elastic | Prompting Guide (5M learners) | I teach how to build with AI ⬇️
1mo

Diagram of Thought (DoT) enhances the reasoning capabilities of LLMs through mathematical rigor. DAT models iterative reasoning in LLM as the construction of a directed acyclic graph. It integrated propositions, critiques, refinement, and verification into a unified DAG structure. This allows DoT to capture complex logical deduction beyond linear or tree-based approaches. No external intervention is used but it does leverage role-specific tokens to generate detailed reasoning processes. https://lnkd.in/eyCqeuX6 ↓ Join 85K AI researchers and devs so you don’t miss my weekly summary of the top AI and LLM papers: https://lnkd.in/e6ajg945
Like Comment
To view or add a comment, sign in
Vaibhav Wadhwa

Accomplished AI/ML Engineer & Data Science Intern | Expertise in Python, Deep Learning, NLP, Computer Vision & Full-Stack Development
2w
Report this post
The best computer vision lesson you'll see today. → Understand the shift from CNNs to Vision Transformers. 3 ways to do this: 1. Analyze recent ML competition results. 2. Compare model architectures and capabilities. 3. Experiment with both approaches on benchmark datasets. Staying current with evolving CV techniques is crucial for pushing the boundaries of AI. P.S. What's your experience with CNNs vs. Transformers? Share below! Example: ImageNet competition winners shifting from CNN-based models to Transformer architectures in recent years.
Like Comment
To view or add a comment, sign in
Pietro Bolcato

Lead AI Engineer @Kittl | Gen AI, LLM, Diffusion, CV, NLP | MLOps | MSc AI, double degree from KTH 🇸🇪 and TU Berlin 🇩🇪 | 2x Azure AI certified
2w
Report this post
🚀 Love this paper I just read! Researchers have proposed a novel method called Low-rank Linear Conversion via Attention Transfer (LoLCATs, gotta love the name!) to improve the efficiency of linearizing LLMs. This method aims to replace the quadratic attentions in Transformer-based LLMs with subquadratic alternatives, like linear attention, without compromising on quality. LoLCATs involves two main steps: 1. Training linear attentions to closely approximate softmax attentions using an output mean squared error loss, known as "attention transfer." 2. Adjusting for approximation errors and recovering LLM quality with low-rank adaptation (LoRA). This approach significantly reduces the memory and compute requirements, making it possible to linearize larger models, such as Llama 3 70B and 405B, with improved quality and efficiency. The method has shown a 20 point improvement on 5-shot MMLU tasks compared to previous methods. 🔗 Paper: https://lnkd.in/eHEZk5BA ⤵️ Follow me for daily, curated, bite-sized updates on AI—focused on what truly matters to keep you ahead of the curve ⚡️
Like Comment
To view or add a comment, sign in

964 followers

View Profile Follow

Emergence’s Post

More Relevant Posts

Explore topics