Stars
✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
🔥🔥First-ever hour scale video understanding models
[Neurips 24' D&B] Official Dataloader and Evaluation Scripts for LongVideoBench.
🔥🔥MLVU: Multi-task Long Video Understanding Benchmark
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
This is the official code of VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding (ECCV 2024)
Accelerating the development of large multimodal models (LMMs) with lmms-eval
Dense Passage Retriever - is a set of tools and models for open domain Q&A task.
This is a PyTorch implementation of 3DGCTR proposed by our paper “Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization”
Repository for Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions, ACL23
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant
An Adversarial Training Framework for Adversarial Robustness in Deep Learning Models
Long Context Transfer from Language to Vision
Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning
GRiT: A Generative Region-to-text Transformer for Object Understanding (https://arxiv.org/abs/2212.00280)
[NeurIPS 2022 Spotlight] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
A collection of strong multimodal models for building multimodal AGI agents
a way to download the dataset of ActivityNet
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
A Multimodal Native Agent Framework for Smart Hardware and More
Transform Video as a Document with ChatGPT, CLIP, BLIP2, GRIT, Whisper, LangChain.
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
PyTorch3D is FAIR's library of reusable components for deep learning with 3D data