GitHub - gyxxyg/VTG-LLM: [Preprint] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

If our project helps you, please give us a star ⭐ and cite our paper!

News

10/10/2024, We released a more powerful temporal grounding video LLM TRACE.
7/22/2024, Update evaluation results using various temperature.
5/28/2024, NPU checkpoints can be fine-tuned on V100 GPU.

Overview

We introduce

VTG-IT-120K, a high-quality and comprehensive instruction tuning dataset that covers VTG tasks such as moment retrieval (63.2K), dense video captioning (37.2K), video summarization (15.2K), and video highlight detection (3.9K).
VTG-LLM, which (1) effectively integrates timestamp knowledge into visual tokens; (2) incorporates absolute-time tokens that specifically handle timestamp knowledge, thereby avoiding concept shifts; and (3) introduces a lightweight, high-performance slot-based token compression method to facilitate the sampling of more video frames.

Overview of VTG-LLM.

Enviroments

We recommend utilizing NPU environments for training, evaluation, and fine-tuning. The environment we use can be found in environment-npu.yaml. Additionally, we have discovered that executing the script below is sufficient for most scenarios.

bash install_requirements.sh

If an NPU is not available, a V100 can also be employed for training and evaluation, but it cannot be used for fine-tuning checkpoints trained by an NPU. The necessary environments can be found in requirements-v100.txt.

Model Checkpoints

The model checkpoint (without finetuning) is avaliable at huggingface:

git lfs install

git clone https://huggingface.co/Yongxin-Guo/VTG-LLM

Data

See DATA.md for details. The data annotations are avaliable at huggingface:

git lfs install

git clone https://huggingface.co/datasets/Yongxin-Guo/VTG-IT

Requirments

Please download the following model checkpoints:

Script

Tuning

Config the checkpoint and dataset paths in pretrain-slot-sample-fmt-96.yaml. Config the bert checkpoint paths in blip2.py and vtgllm.py

torchrun --nproc_per_node=16 train.py --cfg-path  train_configs/videollama/pretrain-slot-sample-fmt-96.yaml

Evaluation

Config the checkpoint and dataset paths in videollama-slot-96.yaml.

Config the downstream task in eval.sh.

bash eval.sh

Results

Youcook2	CIDER	METEOR	SODA_c	F1
t=1.0 (paper)	5.0	1.9	1.5	17.5
t=0.1	5.4	1.8	1.6	18.4

Charades-STA	0.3	0.5	0.7
t=1.0 (paper)	52.0	33.8	15.7
t=0.1	53.9	36.3	16.6

QVHighlights	mAP	Hit@1
t=1.0 (paper)	16.5	33.5
t=0.1	16.2	30.7

ActivityNet	CIDER	METEOR	SODA_c	F1
t=1.0 (paper)	18.2	5.7	4.7	34.0
t=0.1	20.7	5.9	5.1	34.8

Demo

# cat_and_chicken.mp4

# Describe this video

A cute little kitten is sleeping on a couch. A little chicken is sitting on the cats chest and looking at the camera. The cat is purring and the chicken is moving its head.

# Please locate a series of events in the video, output the start and end timestamps of each event, and describe each event in sentences.

0000.0 - 0010.0 seconds, A cute kitten is sleeping on a couch. 0010.0 - 0020.0 seconds, A yellow bird lands on the couch and gently touches the kitten's head. 0020.0 - 0030.0 seconds, The bird picks up the kitten and starts to play with it. 0030.0 - 0040.0 seconds, The kitten tries to push the bird away, but the bird continues to play with it. 0040.0 - 0050.0 seconds, The kitten falls asleep on the couch.

Gradio Demo

You need to firstly change the path of videos and model checkpoints to your path.

python gradio_demo.py

Recommended GPUs

Instruction-tuning: 16xATN 910B
Inference: 1xV100

Acknowledgement

We are grateful for the following awesome projects:

Bibliography

If you find this repository helpful for your project, please consider citing:

@article{guo2024vtg,
  title={VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding},
  author={Guo, Yongxin and Liu, Jingyu and Li, Mingda and Tang, Xiaoying and Chen, Xi and Zhao, Bo},
  journal={arXiv preprint arXiv:2405.13382},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
docs		docs
eval_configs		eval_configs
figures		figures
metrics		metrics
prompts		prompts
train_configs		train_configs
utils		utils
vtgllm		vtgllm
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE_Lavis.md		LICENSE_Lavis.md
LICENSE_Minigpt4.md		LICENSE_Minigpt4.md
LICENSE_timechat.md		LICENSE_timechat.md
README.md		README.md
cat_and_chicken.mp4		cat_and_chicken.mp4
environment-npu.yaml		environment-npu.yaml
eval.sh		eval.sh
evaluate.py		evaluate.py
example.py		example.py
fusion_result.json		fusion_result.json
ge_check_op.json		ge_check_op.json
gradio_demo.py		gradio_demo.py
install_requirements-npu.sh		install_requirements-npu.sh
install_requirements-v100.sh		install_requirements-v100.sh
requirements-v100.txt		requirements-v100.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

If our project helps you, please give us a star ⭐ and cite our paper!

News

Overview

Enviroments

Model Checkpoints

Data

Requirments

Script

Tuning

Evaluation

Results

Demo

Gradio Demo

Recommended GPUs

Acknowledgement

Bibliography

About

Licenses found

Releases

Packages

Languages

License

Licenses found

gyxxyg/VTG-LLM

Folders and files

Latest commit

History

Repository files navigation

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

If our project helps you, please give us a star ⭐ and cite our paper!

News

Overview

Enviroments

Model Checkpoints

Data

Requirments

Script

Tuning

Evaluation

Results

Demo

Gradio Demo

Recommended GPUs

Acknowledgement

Bibliography

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages