- Agent AI: Surveying the Horizons of Multimodal Interaction [arXiv 2401] [paper]
- MM-LLMs: Recent Advances in MultiModal Large Language Models [arXiv 2401] [paper]
- https://github.com/LinkSoul-AI/Chinese-Llama-2-7b
- https://github.com/ymcui/Chinese-LLaMA-Alpaca-2
- https://github.com/LlamaFamily/Llama2-Chinese
- Sequential Modeling Enables Scalable Learning for Large Vision Models [arXiv 2312] [paper] [code] (💥Visual GPT Time?)
- UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [arXiv 2401] [paper] [code]
- SAM: Segment Anything Model [ICCV 2023 Best Paper Honorable Mention] [paper] [code]
- SSA: Semantic segment anything [github 2023] [paper] [code]
- SEEM: Segment Everything Everywhere All at Once [arXiv 2304] [paper] [code]
- RAM: Recognize Anything - A Strong Image Tagging Model [arXiv 2306] [paper] [code]
- Semantic-SAM: Segment and Recognize Anything at Any Granularity [arXiv 2307] [paper] [code]
- UNINEXT: Universal Instance Perception as Object Discovery and Retrieval [CVPR 2023] [paper] [code]
- APE: Aligning and Prompting Everything All at Once for Universal Visual Perception [arXiv 2312] [paper] [code]
- GLEE: General Object Foundation Model for Images and Videos at Scale [arXiv 2312] [paper] [code]
- OMG-Seg : Is One Model Good Enough For All Segmentation? [arXiv 2401] [paper] [[code]]](https://github.com/lxtGH/OMG-Seg)
- Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [arXiv 2401] [paper] [[code]]](https://github.com/LiheYoung/Depth-Anything)
- ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation [arXiv 2401] [paper] [[code]]](https://github.com/Lszcoding/ClipSAM)
- PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation [arXiv 2401] [paper] [[code]]](https://github.com/xzz2/pa-sam)
- YOLO-World: Real-Time Open-Vocabulary Object Detection [arXiv 2401] [paper] [[code]]](https://github.com/AILab-CVC/YOLO-World)
Model | Vision | Projector | LLM | OKVQA | GQA | VSR | IconVQA | VizWiz | HM | VQAv2 | SQAI | VQAT | POPE | MMEP | MMEC | MMB | MMBCN | SEEDI | LLaVAW | MM-Vet | QBench |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MiniGPT-v2 | EVA-Clip-g | Linear | LLaMA-2-7B | 56.92 | 60.3 | 60.62 | 47.72 | 32.9 | 58.22 | ||||||||||||
MiniGPT-v2-Chat | EVA-Clip-g | Linear | LLaMA-2-7B | 57.81 | 60.1 | 62.91 | 51.51 | 53.6 | 58.81 | ||||||||||||
Qwen-VL-Chat | Qwen-7B | 57.5∗ | 38.9 | 78.2∗ | 68.2 | 61.5 | 1487.5 | 360.72 | 60.6 | 56.7 | 58.2 | ||||||||||
LLaVA-1.5 | Vicuna-1.5-7B | 62.0∗ | 50.0 | 78.5∗ | 66.8 | 58.2 | 85.91 | 1510.7 | 316.1 | 64.3 | 58.3 | 58.6 | 63.4 | 30.5 | 58.7 | ||||||
LLaVA-1.5 ShareGPT4V | Vicuna-1.5-7B | 57.2 | 80.62 | 68.4 | 1567.42 | 376.41 | 68.8 | 62.2 | 69.71 | 72.6 | 37.6 | 63.41∗ | |||||||||
LLaVA-1.5 | Vicuna-1.5-13B | 63.31 | 53.6 | 80.0∗ | 71.6 | 61.3 | 85.91 | 1531.3 | 295.4 | 67.7 | 63.6 | 61.6 | 70.7 | 35.4 | 62.12∗ | ||||||
VILA-7B | LLaMA-2-7B | 62.3∗ | 57.8 | 79.9∗ | 68.2 | 64.4 | 85.52∗ | 1533.0 | 68.9 | 61.7 | 61.1 | 69.7 | 34.9 | ||||||||
VILA-13B | LLaMA-2-13B | 63.31∗ | 60.62 | 80.81∗ | 73.71∗ | 66.61∗ | 84.2 | 1570.11∗ | 70.32∗ | 64.32∗ | 62.82∗ | 73.02∗ | 38.82∗ | ||||||||
VILA-13B ShareGPT4V | LLaMA-2-13B | 63.22∗ | 62.41 | 80.62∗ | 73.12∗ | 65.32∗ | 84.8 | 1556.5 | 70.81∗ | 65.41∗ | 61.4 | 78.41∗ | 45.71∗ | ||||||||
SPHINX | |||||||||||||||||||||
SPHINX-Plus | |||||||||||||||||||||
SPHINX-Plus-2K | |||||||||||||||||||||
SPHINX-MoE | |||||||||||||||||||||
InternVL | |||||||||||||||||||||
LLaVA-1.6 | |||||||||||||||||||||
indicates ShareGPT4V's (Chen et al., 2023e) re-implemented test results.
∗ indicates that the training images of the datasets are observed during training.
- LAVIS: A Library for Language-Vision Intelligence [ACL 2023] [paper] [code]
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [ICML 2023] [paper] [code]
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [arXiv 2305] [paper] [code]
- MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models [arXiv 2304] [paper] [code]
- MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning [github 2310] [paper] [code]
- VisualGLM-6B: Chinese and English multimodal conversational language model [ACL 2022] [paper] [code]
- Kosmos-2: Grounding Multimodal Large Language Models to the World [arXiv 2306] [paper] [code]
- NExT-GPT: Any-to-Any Multimodal LLM [arXiv 2309] [paper] [code]
- LLaVA/-1.5: Large Language and Vision Assistant [NeurIPS 2023] [paper] [arXiv 2310] [paper] [code]
- 🦉mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [arXiv 2304] [paper] [code]
- 🦉mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration [arXiv 2311] [paper] [code]
- VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [arXiv 2305] [paper] [code]
- 🦅Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic [arXiv 2306] [paper] [code]
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond [arXiv 2308] [paper] [code]
- LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [arXiv 2309] [paper] [code]
- AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model [arXiv 2309] [paper] [code]
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition [arXiv 2309] [paper] [code]
- MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens [arXiv 2310] [paper] [code]
- CogVLM: Visual Expert for Large Language Models [github 2310] [paper] [code]
- 🐦Woodpecker: Hallucination Correction for Multimodal Large Language Models [arXiv 2310] [paper] [code]
- SoM: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [arXiv 2310] [paper] [code]
- Ferret: Refer and Ground Anything Any-Where at Any Granularity [arXiv 2310] [paper] [code]
- 🦦OtterHD: A High-Resolution Multi-modality Model [arXiv 2311] [paper] [code]
- NExT-Chat: An LMM for Chat, Detection and Segmentation [arXiv 2311] [paper] [project]
- Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models [arXiv 2311] [paper] [code]
- InfMLLM: A Unified Framework for Visual-Language Tasks [arXiv 2311] [paper] [code]
- Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [paper] [code] [dataset]
- 🦁LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [arXiv 2311] [paper] [code]
- 🐵Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models [arXiv 2311] [paper] [code]
- CG-VLM: Contrastive Vision-Language Alignment Makes Efficient Instruction Learner [arXiv 2311] [paper] [code]
- 🐲PixelLM: Pixel Reasoning with Large Multimodal Model [arXiv 2312] [paper] [code]
- 🐝Honeybee: Locality-enhanced Projector for Multimodal LLM [arXiv 2312] [paper] [code]
- VILA: On Pre-training for Visual Language Models [arXiv 2312] [paper] [code]
- CogAgent: A Visual Language Model for GUI Agents [arXiv 2312] [paper] [code] (support 1120×1120 resolution)
- PixelLLM: Pixel Aligned Language Models [arXiv 2312] [paper] [code]
- 🦅Osprey: Pixel Understanding with Visual Instruction Tuning [arXiv 2312] [paper] [code]
- Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action [arXiv 2312] [paper] [code]
- VistaLLM: Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [arXiv 2312] [paper] [code]
- Emu2: Generative Multimodal Models are In-Context Learners [arXiv 2312] [paper] [code]
- V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs [arXiv 2312] [paper] [code]
- BakLLaVA-1: BakLLaVA 1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture [github 2310] [paper] [code]
- LEGO: Language Enhanced Multi-modal Grounding Model [arXiv 2401] [paper] [code]
- MMVP: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [arXiv 2401] [paper] [code]
- ModaVerse: Efficiently Transforming Modalities with LLMs [arXiv 2401] [paper] [code]
- MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [arXiv 2401] [paper] [code]
- LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs [arXiv 2401] [paper] [code]
- 🎓InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models [arXiv 2401] [paper] [code]
- MouSi: Poly-Visual-Expert Vision-Language Models [arXiv 2401] [paper] [code]
- Yi Vision Language Model [HF 2401]
- Generating Images with Multimodal Language Models [NeurIPS 2023] [paper] [code]
- DreamLLM: Synergistic Multimodal Comprehension and Creation [arXiv 2309] [paper] [code]
- Guiding Instruction-based Image Editing via Multimodal Large Language Models [arXiv 2309] [paper] [code]
- KOSMOS-G: Generating Images in Context with Multimodal Large Language Models [arXiv 2310] [paper] [code]
- LLMGA: Multimodal Large Language Model based Generation Assistant [arXiv 2311] [paper] [code]
- UniAD: Planning-oriented Autonomous Driving [CVPR 2023] [paper] [code]
- Scene as Occupancy [arXiv 2306] [paper] [code]
- FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving [arXiv 2308] [paper] [code]
- BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning [arXiv 2310] [paper] [code]
- UniVision: A Unified Framework for Vision-Centric 3D Perception [arXiv 2401] [paper] [code]
- Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [arXiv 2307] [paper] [code]
- LINGO-1: Exploring Natural Language for Autonomous Driving (Vision-Language-Action Models, VLAMs) [Wayve 2309] [blog]
- DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [arXiv 2310] [paper] [code]
- VIMA: General Robot Manipulation with Multimodal Prompts [arXiv 2210] [paper] [code]
- PaLM-E: An Embodied Multimodal Language Model [arXiv 2303] [paper] [code]
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [arXiv 2307] [CoRL 2023] [paper] [code]
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [arXiv 2307] [paper] [project]
- RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking [arXiv 2309] [paper] [code]
- MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [arXiv 2401] [paper] [code]
- EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision [arXiv 2311] [paper] [code]
- ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image [arXiv 2310] [paper] [code]
- Vlogger: Make Your Dream A Vlog [arXiv 2401] [paper] [code]
- BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models [arXiv 2401] [paper] [code]
- CWM: Unifying (Machine) Vision via Counterfactual World Modeling [arXiv 2306] [paper] [code]
- MILE: Model-Based Imitation Learning for Urban Driving [Wayve 2210] [NeurIPS 2022] [paper] [code] [blog]
- GAIA-1: A Generative World Model for Autonomous Driving [Wayve 2310] [arXiv 2309] [paper] [code]
- ADriver-I: A General World Model for Autonomous Driving [arXiv 2311] [paper] [code]
- OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [arXiv 2311] [paper] [code]
- LWM: World Model on Million-Length Video and Language with RingAttention [arXiv 2402] [paper] [code]
- Sora: Video generation models as world simulators [openai 2402] [technical report] (💥Visual GPT Time?)
- [Instruction Tuning] FLAN: Finetuned Language Models are Zero-Shot Learners [ICLR 2022] [paper] [code]
- DriveLM: Drive on Language [paper] [project]
- MagicDrive: Street View Generation with Diverse 3D Geometry Control [arXiv 2310] [paper] [code]
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper] [project] [blog]
- To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning (LVIS-Instruct4V) [arXiv 2311] [paper] [code] [dataset]
- Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [paper] [code] [dataset]
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [paper] [code] [dataset]
- Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [arXiv 2401] [paper] [code]
- VMamba: Visual State Space Model [arXiv 2401] [paper] [code]
- Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences [arXiv 2401] [paper] [code]
- SenseNova 商汤日日新开放平台 [url]