An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
video localization caption alignment segmentation coin multimodality joint multimodal-sentiment-analysis pretrain pretraining msrvtt video-text-retrieval video-text video-language youcookii retrieval-task caption-task
-
Updated
Jul 25, 2024 - Python