Skip to content
/ VTG-LLM Public

[Preprint] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

License

Apache-2.0 and 3 other licenses found

Licenses found

Apache-2.0
LICENSE
BSD-3-Clause
LICENSE_Lavis.md
BSD-3-Clause
LICENSE_Minigpt4.md
BSD-3-Clause
LICENSE_timechat.md
Notifications You must be signed in to change notification settings

gyxxyg/VTG-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
If our project helps you, please give us a star ⭐ and cite our paper!

hf_space hf_checkpoint hf_data arxiv Hits

News

  • 10/10/2024, We released a more powerful temporal grounding video LLM TRACE.
  • 7/22/2024, Update evaluation results using various temperature.
  • 5/28/2024, NPU checkpoints can be fine-tuned on V100 GPU.

Overview

We introduce

  • VTG-IT-120K, a high-quality and comprehensive instruction tuning dataset that covers VTG tasks such as moment retrieval (63.2K), dense video captioning (37.2K), video summarization (15.2K), and video highlight detection (3.9K).
  • VTG-LLM, which (1) effectively integrates timestamp knowledge into visual tokens; (2) incorporates absolute-time tokens that specifically handle timestamp knowledge, thereby avoiding concept shifts; and (3) introduces a lightweight, high-performance slot-based token compression method to facilitate the sampling of more video frames.
Overview of VTG-LLM
Overview of VTG-LLM.

Enviroments

We recommend utilizing NPU environments for training, evaluation, and fine-tuning. The environment we use can be found in environment-npu.yaml. Additionally, we have discovered that executing the script below is sufficient for most scenarios.

bash install_requirements.sh

If an NPU is not available, a V100 can also be employed for training and evaluation, but it cannot be used for fine-tuning checkpoints trained by an NPU. The necessary environments can be found in requirements-v100.txt.

Model Checkpoints

The model checkpoint (without finetuning) is avaliable at huggingface:

git lfs install

git clone https://huggingface.co/Yongxin-Guo/VTG-LLM

Data

See DATA.md for details. The data annotations are avaliable at huggingface:

git lfs install

git clone https://huggingface.co/datasets/Yongxin-Guo/VTG-IT

Requirments

Please download the following model checkpoints:

Script

Tuning

Config the checkpoint and dataset paths in pretrain-slot-sample-fmt-96.yaml. Config the bert checkpoint paths in blip2.py and vtgllm.py

torchrun --nproc_per_node=16 train.py --cfg-path  train_configs/videollama/pretrain-slot-sample-fmt-96.yaml

Evaluation

Config the checkpoint and dataset paths in videollama-slot-96.yaml.

Config the downstream task in eval.sh.

bash eval.sh

Results

Youcook2 CIDER METEOR SODA_c F1
t=1.0 (paper) 5.0 1.9 1.5 17.5
t=0.1 5.4 1.8 1.6 18.4
Charades-STA 0.3 0.5 0.7
t=1.0 (paper) 52.0 33.8 15.7
t=0.1 53.9 36.3 16.6
QVHighlights mAP Hit@1
t=1.0 (paper) 16.5 33.5
t=0.1 16.2 30.7
ActivityNet CIDER METEOR SODA_c F1
t=1.0 (paper) 18.2 5.7 4.7 34.0
t=0.1 20.7 5.9 5.1 34.8

Demo

# cat_and_chicken.mp4

# Describe this video

A cute little kitten is sleeping on a couch. A little chicken is sitting on the cats chest and looking at the camera. The cat is purring and the chicken is moving its head.

# Please locate a series of events in the video, output the start and end timestamps of each event, and describe each event in sentences.

0000.0 - 0010.0 seconds, A cute kitten is sleeping on a couch. 0010.0 - 0020.0 seconds, A yellow bird lands on the couch and gently touches the kitten's head. 0020.0 - 0030.0 seconds, The bird picks up the kitten and starts to play with it. 0030.0 - 0040.0 seconds, The kitten tries to push the bird away, but the bird continues to play with it. 0040.0 - 0050.0 seconds, The kitten falls asleep on the couch.

Gradio Demo

You need to firstly change the path of videos and model checkpoints to your path.

python gradio_demo.py

Recommended GPUs

  • Instruction-tuning: 16xATN 910B
  • Inference: 1xV100

Acknowledgement

We are grateful for the following awesome projects:

Bibliography

If you find this repository helpful for your project, please consider citing:

@article{guo2024vtg,
  title={VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding},
  author={Guo, Yongxin and Liu, Jingyu and Li, Mingda and Tang, Xiaoying and Chen, Xi and Zhao, Bo},
  journal={arXiv preprint arXiv:2405.13382},
  year={2024}
}