ModernAI: Awesome Modern Artificial Intelligence

🔥Hot update in progress ...

Large Model Evolutionary Graph

LLM

MLLM (LLaMA-based)

Survey

Agent AI: Surveying the Horizons of Multimodal Interaction [arXiv 2401] [paper]
MM-LLMs: Recent Advances in MultiModal Large Language Models [arXiv 2401] [paper]

Large Language Model (LLM)

OLMo: Accelerating the Science of Language Models [arXiv 2402] [paper] [code]

Chinese Large Language Model (CLLM)

Large Vision Backbone

AIM: Scalable Pre-training of Large Autoregressive Image Models [arXiv 2401] [paper] [code]

Large Vision Model (LVM)

Sequential Modeling Enables Scalable Learning for Large Vision Models [arXiv 2312] [paper] [code] (💥Visual GPT Time?)

Large Vision-Language Model (VLM)

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [arXiv 2401] [paper] [code]

Vision Foundation Model (VFM)

SAM: Segment Anything Model [ICCV 2023 Best Paper Honorable Mention] [paper] [code]
SSA: Semantic segment anything [github 2023] [paper] [code]
SEEM: Segment Everything Everywhere All at Once [arXiv 2304] [paper] [code]
RAM: Recognize Anything - A Strong Image Tagging Model [arXiv 2306] [paper] [code]
Semantic-SAM: Segment and Recognize Anything at Any Granularity [arXiv 2307] [paper] [code]
UNINEXT: Universal Instance Perception as Object Discovery and Retrieval [CVPR 2023] [paper] [code]
APE: Aligning and Prompting Everything All at Once for Universal Visual Perception [arXiv 2312] [paper] [code]
GLEE: General Object Foundation Model for Images and Videos at Scale [arXiv 2312] [paper] [code]
OMG-Seg : Is One Model Good Enough For All Segmentation? [arXiv 2401] [paper] [[code]]](https://github.com/lxtGH/OMG-Seg)
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [arXiv 2401] [paper] [[code]]](https://github.com/LiheYoung/Depth-Anything)
ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation [arXiv 2401] [paper] [[code]]](https://github.com/Lszcoding/ClipSAM)
PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation [arXiv 2401] [paper] [[code]]](https://github.com/xzz2/pa-sam)
YOLO-World: Real-Time Open-Vocabulary Object Detection [arXiv 2401] [paper] [[code]]](https://github.com/AILab-CVC/YOLO-World)

Multimodal Large Language Model (MLLM) / Large Multimodal Model (LMM)

Model	Vision	Projector	LLM	OKVQA	GQA	VSR	IconVQA	VizWiz	HM	VQA^v2	SQA^I	VQA^T	POPE	MME^P	MME^C	MMB	MMB^CN	SEED^I	LLaVA^W	MM-Vet	QBench
MiniGPT-v2	EVA-Clip-g	Linear	LLaMA-2-7B	56.9²	60.3	60.6²	47.7²	32.9	58.2²
MiniGPT-v2-Chat	EVA-Clip-g	Linear	LLaMA-2-7B	57.8¹	60.1	62.9¹	51.5¹	53.6	58.8¹
Qwen-VL-Chat			Qwen-7B		57.5^∗			38.9		78.2^∗	68.2	61.5		1487.5	360.7²	60.6	56.7	58.2
LLaVA-1.5			Vicuna-1.5-7B		62.0^∗			50.0		78.5^∗	66.8	58.2	85.9¹	1510.7	316.1	64.3	58.3	58.6	63.4	30.5	58.7
LLaVA-1.5 ShareGPT4V			Vicuna-1.5-7B					57.2		80.6²	68.4			1567.4²	376.4¹	68.8	62.2	69.7¹	72.6	37.6	63.4^1∗
LLaVA-1.5			Vicuna-1.5-13B		63.3¹			53.6		80.0^∗	71.6	61.3	85.9¹	1531.3	295.4	67.7	63.6	61.6	70.7	35.4	62.1^2∗
VILA-7B			LLaMA-2-7B		62.3^∗			57.8		79.9^∗	68.2	64.4	85.5^2∗	1533.0		68.9	61.7	61.1	69.7	34.9
VILA-13B			LLaMA-2-13B		63.3^1∗			60.6²		80.8^1∗	73.7^1∗	66.6^1∗	84.2	1570.1^1∗		70.3^2∗	64.3^2∗	62.8^2∗	73.0^2∗	38.8^2∗
VILA-13B ShareGPT4V			LLaMA-2-13B		63.2^2∗			62.4¹		80.6^2∗	73.1^2∗	65.3^2∗	84.8	1556.5		70.8^1∗	65.4^1∗	61.4	78.4^1∗	45.7^1∗
SPHINX
SPHINX-Plus
SPHINX-Plus-2K
SPHINX-MoE
InternVL
LLaVA-1.6

indicates ShareGPT4V's (Chen et al., 2023e) re-implemented test results.
∗ indicates that the training images of the datasets are observed during training.

Paradigm Comparison

LAVIS: A Library for Language-Vision Intelligence [ACL 2023] [paper] [code]
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [ICML 2023] [paper] [code]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [arXiv 2305] [paper] [code]
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models [arXiv 2304] [paper] [code]
MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning [github 2310] [paper] [code]
VisualGLM-6B: Chinese and English multimodal conversational language model [ACL 2022] [paper] [code]
Kosmos-2: Grounding Multimodal Large Language Models to the World [arXiv 2306] [paper] [code]
NExT-GPT: Any-to-Any Multimodal LLM [arXiv 2309] [paper] [code]
LLaVA/-1.5: Large Language and Vision Assistant [NeurIPS 2023] [paper] [arXiv 2310] [paper] [code]
🦉mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [arXiv 2304] [paper] [code]
🦉mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration [arXiv 2311] [paper] [code]
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [arXiv 2305] [paper] [code]
🦅Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic [arXiv 2306] [paper] [code]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond [arXiv 2308] [paper] [code]
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [arXiv 2309] [paper] [code]
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model [arXiv 2309] [paper] [code]
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition [arXiv 2309] [paper] [code]
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens [arXiv 2310] [paper] [code]
CogVLM: Visual Expert for Large Language Models [github 2310] [paper] [code]
🐦Woodpecker: Hallucination Correction for Multimodal Large Language Models [arXiv 2310] [paper] [code]
SoM: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [arXiv 2310] [paper] [code]
Ferret: Refer and Ground Anything Any-Where at Any Granularity [arXiv 2310] [paper] [code]
🦦OtterHD: A High-Resolution Multi-modality Model [arXiv 2311] [paper] [code]
NExT-Chat: An LMM for Chat, Detection and Segmentation [arXiv 2311] [paper] [project]
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models [arXiv 2311] [paper] [code]
InfMLLM: A Unified Framework for Visual-Language Tasks [arXiv 2311] [paper] [code]
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [paper] [code] [dataset]
🦁LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [arXiv 2311] [paper] [code]
🐵Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models [arXiv 2311] [paper] [code]
CG-VLM: Contrastive Vision-Language Alignment Makes Efficient Instruction Learner [arXiv 2311] [paper] [code]
🐲PixelLM: Pixel Reasoning with Large Multimodal Model [arXiv 2312] [paper] [code]
🐝Honeybee: Locality-enhanced Projector for Multimodal LLM [arXiv 2312] [paper] [code]
VILA: On Pre-training for Visual Language Models [arXiv 2312] [paper] [code]
CogAgent: A Visual Language Model for GUI Agents [arXiv 2312] [paper] [code] (support 1120×1120 resolution)
PixelLLM: Pixel Aligned Language Models [arXiv 2312] [paper] [code]
🦅Osprey: Pixel Understanding with Visual Instruction Tuning [arXiv 2312] [paper] [code]
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action [arXiv 2312] [paper] [code]
VistaLLM: Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [arXiv 2312] [paper] [code]
Emu2: Generative Multimodal Models are In-Context Learners [arXiv 2312] [paper] [code]
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs [arXiv 2312] [paper] [code]
BakLLaVA-1: BakLLaVA 1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture [github 2310] [paper] [code]
LEGO: Language Enhanced Multi-modal Grounding Model [arXiv 2401] [paper] [code]
MMVP: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [arXiv 2401] [paper] [code]
ModaVerse: Efficiently Transforming Modalities with LLMs [arXiv 2401] [paper] [code]
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [arXiv 2401] [paper] [code]
LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs [arXiv 2401] [paper] [code]
🎓InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models [arXiv 2401] [paper] [code]
MouSi: Poly-Visual-Expert Vision-Language Models [arXiv 2401] [paper] [code]
Yi Vision Language Model [HF 2401]

Multimodal Small Language Model (MSLM) / Small Multimodal Model (SMM)

Vary-toy: Small Language Model Meets with Reinforced Vision Vocabulary [arXiv 2401] [paper] [code]

Image Generation with MLLM

Generating Images with Multimodal Language Models [NeurIPS 2023] [paper] [code]
DreamLLM: Synergistic Multimodal Comprehension and Creation [arXiv 2309] [paper] [code]
Guiding Instruction-based Image Editing via Multimodal Large Language Models [arXiv 2309] [paper] [code]
KOSMOS-G: Generating Images in Context with Multimodal Large Language Models [arXiv 2310] [paper] [code]
LLMGA: Multimodal Large Language Model based Generation Assistant [arXiv 2311] [paper] [code]

Modern Autonomous Driving (MAD)

End-to-End Solution

UniAD: Planning-oriented Autonomous Driving [CVPR 2023] [paper] [code]
Scene as Occupancy [arXiv 2306] [paper] [code]
FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving [arXiv 2308] [paper] [code]
BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning [arXiv 2310] [paper] [code]
UniVision: A Unified Framework for Vision-Centric 3D Perception [arXiv 2401] [paper] [code]

with Large Language Model

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [arXiv 2307] [paper] [code]
LINGO-1: Exploring Natural Language for Autonomous Driving (Vision-Language-Action Models, VLAMs) [Wayve 2309] [blog]
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [arXiv 2310] [paper] [code]

Embodied AI (EAI) and Robo Agent

VIMA: General Robot Manipulation with Multimodal Prompts [arXiv 2210] [paper] [code]
PaLM-E: An Embodied Multimodal Language Model [arXiv 2303] [paper] [code]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [arXiv 2307] [CoRL 2023] [paper] [code]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [arXiv 2307] [paper] [project]
RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking [arXiv 2309] [paper] [code]
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [arXiv 2401] [paper] [code]

Neural Radiance Fields (NeRF)

EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision [arXiv 2311] [paper] [code]

Diffusion Model

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image [arXiv 2310] [paper] [code]
Vlogger: Make Your Dream A Vlog [arXiv 2401] [paper] [code]
BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models [arXiv 2401] [paper] [code]

World Model

CWM: Unifying (Machine) Vision via Counterfactual World Modeling [arXiv 2306] [paper] [code]
MILE: Model-Based Imitation Learning for Urban Driving [Wayve 2210] [NeurIPS 2022] [paper] [code] [blog]
GAIA-1: A Generative World Model for Autonomous Driving [Wayve 2310] [arXiv 2309] [paper] [code]
ADriver-I: A General World Model for Autonomous Driving [arXiv 2311] [paper] [code]
OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [arXiv 2311] [paper] [code]
LWM: World Model on Million-Length Video and Language with RingAttention [arXiv 2402] [paper] [code]

Artificial Intelligence Generated Content (AIGC)

Text-to-Image

Text-to-Video

Sora: Video generation models as world simulators [openai 2402] [technical report] (💥Visual GPT Time?)

Text-to-3D

Image-to-3D

Artificial General Intelligence (AGI)

New Method

[Instruction Tuning] FLAN: Finetuned Language Models are Zero-Shot Learners [ICLR 2022] [paper] [code]

New Dataset

DriveLM: Drive on Language [paper] [project]
MagicDrive: Street View Generation with Diverse 3D Geometry Control [arXiv 2310] [paper] [code]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper] [project] [blog]
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning (LVIS-Instruct4V) [arXiv 2311] [paper] [code] [dataset]
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [paper] [code] [dataset]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [paper] [code] [dataset]

New Vision Backbone

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [arXiv 2401] [paper] [code]
VMamba: Visual State Space Model [arXiv 2401] [paper] [code]

Benchmark

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences [arXiv 2401] [paper] [code]

Platform and API

SenseNova 商汤日日新开放平台 [url]

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ModernAI: Awesome Modern Artificial Intelligence

🔥Hot update in progress ...

Large Model Evolutionary Graph

Survey

Large Language Model (LLM)

Chinese Large Language Model (CLLM)

Large Vision Backbone

Large Vision Model (LVM)

Large Vision-Language Model (VLM)

Vision Foundation Model (VFM)

Multimodal Large Language Model (MLLM) / Large Multimodal Model (LMM)

Multimodal Small Language Model (MSLM) / Small Multimodal Model (SMM)

Image Generation with MLLM

Modern Autonomous Driving (MAD)

End-to-End Solution

with Large Language Model

Embodied AI (EAI) and Robo Agent

Neural Radiance Fields (NeRF)

Diffusion Model

World Model

Artificial Intelligence Generated Content (AIGC)

Text-to-Image

Text-to-Video

Text-to-3D

Image-to-3D

Artificial General Intelligence (AGI)

New Method

New Dataset

New Vision Backbone

Benchmark

Platform and API

SOTA Downstream Task

Zero-shot Object Detection about of Visual Grounding, Opne-set, Open-vocabulary, Open-world

About

Releases

Packages

License

becauseofAI/ModernAI

Folders and files

Latest commit

History

Repository files navigation

ModernAI: Awesome Modern Artificial Intelligence

🔥Hot update in progress ...

Large Model Evolutionary Graph

Survey

Large Language Model (LLM)

Chinese Large Language Model (CLLM)

Large Vision Backbone

Large Vision Model (LVM)

Large Vision-Language Model (VLM)

Vision Foundation Model (VFM)

Multimodal Large Language Model (MLLM) / Large Multimodal Model (LMM)

Multimodal Small Language Model (MSLM) / Small Multimodal Model (SMM)

Image Generation with MLLM

Modern Autonomous Driving (MAD)

End-to-End Solution

with Large Language Model

Embodied AI (EAI) and Robo Agent

Neural Radiance Fields (NeRF)

Diffusion Model

World Model

Artificial Intelligence Generated Content (AIGC)

Text-to-Image

Text-to-Video

Text-to-3D

Image-to-3D

Artificial General Intelligence (AGI)

New Method

New Dataset

New Vision Backbone

Benchmark

Platform and API

SOTA Downstream Task

Zero-shot Object Detection about of Visual Grounding, Opne-set, Open-vocabulary, Open-world

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages