AI2 Incubator’s Post

View organization page for AI2 Incubator, graphic

2,892 followers

View profile for Vu Ha, graphic

I work with founders building AI products and early (pre-seed) startups

For the second week in a row, Harmonious' spotlight paper is about a new benchmark: BLINK: Multimodal Large Language Models Can See but Not Perceive. Its authors are with UPenn, U Washington, AI2, UC Davis, and Columbia U. BLINK is a benchmark containing 14 visual perception tasks that can be solved by humans “within a blink”, but pose significant challenges for current multimodal LLMs since they resist mediation through natural language (i.e. dense captioning). While humans get 96% accuracy, the best-performing GPT-4V, Gemini Pro, and Claude Opus achieve accuracies of 51%, 45%, and 43% respectively, not much better than random guessing (38%). This indicates that such perception abilities have not “emerged” yet in recent multimodal LLMs. Notably, for certain tasks some multimodal LLMs even underperform compared to random guessing. Specialist CV models could solve these problems much better. Read our analysis on Harmonious at https://lnkd.in/gTMwH72C where we discuss related topics such as the Moravec paradox, System 1 and 2, and the path to AGI in addition to recommendations for practitioners. Sign up at Harmonious.ai to never miss our weekly paper roundup! #harmonious #ai2incubator

Weekly paper roundup: BLINK: multimodal LLMs can see but not perceive (4/15/2024)

Weekly paper roundup: BLINK: multimodal LLMs can see but not perceive (4/15/2024)

harmonious.ai

To view or add a comment, sign in

Explore topics