Skip to content
View Haotian-Zhang's full-sized avatar
👋
Welcome
👋
Welcome

Block or report Haotian-Zhang

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Python 112 8 Updated Aug 29, 2024

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Python 103 1 Updated Aug 23, 2024

[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

Python 2,147 125 Updated Aug 29, 2024

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…

Jupyter Notebook 10,680 857 Updated Aug 21, 2024

Scaling Diffusion Transformers with Mixture of Experts

Python 178 7 Updated Sep 9, 2024
Python 332 11 Updated Sep 2, 2024

Visualize expert firing frequencies across sentences in the Mixtral MoE model

Python 17 2 Updated Dec 22, 2023

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Python 715 48 Updated Sep 13, 2024

DataComp for Language Models

HTML 1,103 96 Updated Sep 5, 2024

Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Python 970 54 Updated Jul 14, 2024

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)

Python 1,781 140 Updated Sep 10, 2024

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

Python 1,683 110 Updated Sep 10, 2024

[ECCV 2024] Official Repository for DiffiT: Diffusion Vision Transformers for Image Generation

431 13 Updated Jul 1, 2024
Jupyter Notebook 962 117 Updated Apr 27, 2024

🔥stable, simple, state-of-the-art VQVAE toolkit & cookbook

Python 34 Updated Jun 23, 2024

Code&Data for Grounded 3D-LLM with Referent Tokens

Python 72 Updated Jul 1, 2024

LLM101n: Let's build a Storyteller

28,230 1,538 Updated Aug 1, 2024

[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.

Python 221 9 Updated Jul 17, 2024

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40 benchmarks

Python 1,012 142 Updated Sep 14, 2024

Code for 3D-LLM: Injecting the 3D World into Large Language Models

Python 898 55 Updated Jun 6, 2024

MiniCPM3-4B: An edge-side LLM that surpasses GPT-3.5-Turbo.

Python 6,734 429 Updated Sep 12, 2024

Implementation of Infini-Transformer in Pytorch

Python 100 1 Updated Aug 13, 2024
Python 106 6 Updated Jun 6, 2024

Lumina-T2X is a unified framework for Text to Any Modality Generation

Python 2,020 85 Updated Aug 6, 2024
Python 2,385 165 Updated Sep 14, 2024

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

3,406 134 Updated Aug 10, 2024

Multimodal Models in Real World

Jupyter Notebook 372 17 Updated Jul 12, 2024

Vector (and Scalar) Quantization, in Pytorch

Python 2,395 196 Updated Sep 4, 2024

[CVPR 2024] 🎬💭 chat with over 10K frames of video!

Python 488 39 Updated Sep 6, 2024

Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions"

Python 349 9 Updated Sep 2, 2024
Next