See SylphAI’s activity on LinkedIn

SylphAI reposted this

Author of AdalFlow | AI researcher | x MetaAI

2mo Edited

There is no best model, only the best model for your task and your data. We benchmarked a classification task across major LLMs with few-shot ICL: GPT-4o, Gemini-Pro-1.5, LLama3, Claude3, and Mixtral, using the same prompt and the same temperature. Here are three major discoveries: 1. You never know how models will perform on your task unless you try: - GPT-4o is claimed to be better than GPT-4-turbo, but it falls short by 5% compared with even GPT-3.5. And GPT-3.5 can be almost as competitive as GPT-4s given the right task. - Gemini-Pro-1.5 is surprisingly the best even at zero-shot. - Llama3-70B and Claude3-opus are on par for this task. 2. Few-shot ICL is not always necessary. For some models, it boosts performance, and for others, it can cause a regression. 3. The same prompt, tuned on one model, can be applied directly to almost all models and still perform best across all. What are your experiences? 🤔 #ml #lightrag #llms #artificialintelligence ____________ This is part of the work of LightRAG, the "PyTorch" library for LLM applications. Follow hit 🔔 to stay updated.

26 Comments

Li Yin

Author of AdalFlow | AI researcher | x MetaAI

2mo

For those who like the tech details (1) prompt is tuned on llama3-8b, and adapted to others without change (2) Chain of thought is used on zeroshot. It really helps on this classification task (3) we output yaml with three fields: thought, class_name, class_label

6 Reactions

Haozhu Wang

Applied Scientist at Amazon Bedrock | Foundation Models | Generative AI

2mo

Different models often require prompts optimized specifically for the model to achieve the optimal performance on the downstream task. Do you mind sharing more details how you picked the prompt that's used across all models? Would love to see a more detailed technical report on your study.

6 Reactions

Aditya Lahiri

Conversational AI

2mo

Interesting, are the experiment details open sources or available anywhere?

1 Reaction

Syed Muzani

Mobile Engineer at Xendit

2mo

The task matters too. Gemini Flash has been amazing at 10-shot poetry writing, beating even GPT-4. Claude does poorly at creative writing, but for general purposes, even Claude Haiku has been more effective than GPT-3.5. GPT-3 was still one of my favorites, but they seem to have nerfed it a lot for "safety" and I can no longer get it to write good poetry. Out of all these, I tend to like GPT-3.5 the second least (after mixtral). I'm surprised it scores so high on these benchmarks.

Rakesh Gohel

2mo

The fact that a single well-tuned prompt can be effective across multiple models is encouraging. It implies that once a good prompt is crafted, it can be a versatile tool, saving time and effort in re-engineering prompts for different models. In my experience, it's essential to maintain a flexible approach. Regularly benchmarking and reassessing models as they evolve can provide continuous improvements.

🚀 Jeremy ARANCIO

NLP Machine Learning Engineer - AWS - Contractor

2mo

Great point! I also created a custom benchmark in my current project, and despite all the leaderboards, Gpt-3.5-turbo performs better on the task than GPT-4-Turbo, Claude 3, and Gemini-1.5-Pro It's not about what these models can do overall. It's just about how they perform on your specific task

Hasan Rafiq

Google AI | Lead ML Engineer | Demystifying AI

2mo

Exactly true and this is how we can rephrase the old saying "its not about the best ML algorithm out there but the one which works best for you" This is now rephrased to " the GenAI model that works best for you "

1 Reaction

Hashem Alsaket

Principal AI/ML Engineer - Tech Lead

2mo

Great work. NFL.

1 Reaction

Ram Seshadri

Google Machine Learning Program Manager, formerly Data Scientist @ Morgan Stanley, Instructor @ General Assembly & NYIF. Creator of popular Coursera specialization "Machine Learning for Trading and Finance"

2mo

Fantastic study Li Yin ! I would request you also please share a screenshot of the Macro-F1 scores for various models since I suspect those Macro scores will be much lower than micro-F1 scires which will be skewed by the dominant class.

1 Reaction

See more comments

To view or add a comment, sign in

SylphAI’s Post

More from this author

My Experience in ML Expert Consulting- How it is done now and how SylphAI does it differently and better?

Explore topics