mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)
https://arxiv.org/abs/2205.12005
We presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross modal skip-connections. mPLUG achieves state-of-the-art results on a wide range of vision language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering.
- 2023.5.08: Moved from AliceMind repo for further update.
- 2022.8.28: Released mPLUG downstream tasks!
- Pre-trained models
For VQA and image captioning tasks, we do an additional continue pre-training on 4M image-text pairs based mplug.en.large to get mplug.en.large.v2.
Model | Visual Backbone | Text Enc Layers | Fusion Layers | Text Dec Layers | #params | Download |
---|---|---|---|---|---|---|
mplug.en.base | vit-b-16 | 6 | 6 | 12 | 350M | mplug.en.base |
mplug.en.large | vit-l-14 | 6 | 6 | 12 | 600M | mplug.en.large |
mplug.en.large.v2 | vit-l-14 | 6 | 6 | 12 | 600M | mplug.en.large.v2 |
mplug.en.huge | vit-l-14 | 24 | 6 | 12 | 1.1B | comming soon |
- Pre-train Datasets
COCO | VG | SBU | CC3M | CC13M | |
---|---|---|---|---|---|
image | 113K | 100K | 860K | 3M | 10M |
text | 567K | 769K | 860K | 3M | 10M |
- Image-text
Task | VQA | Image Captioning | Retrieval | Referring Expression Comprehension | Visual Entailment | ||||
---|---|---|---|---|---|---|---|---|---|
Dataset | VQA v2 | COCO | MSCOCO | Flickr30K | RefCOCO | RefCOCO | RefCOCOg | SNLI-VE | NLVR2 |
Split | test-dev/test-std | Karpathy test (CE/CIDEr) | 5k test (TR/IR) | 1k test (TR/IR) | val/test-a/test-b | val/test-a/test-b | val-u/test-u | val/test | dev/test-P |
Metric | Acc. | CIDEr | R@1 | R@1 | Acc. | Acc. | Acc. | ||
mPLUGBase | 79.79/79.98 | 137.5/150.4 | -/- | -/- | -/- | -/- | -/- | -/- | -/- |
mPLUGLarge | 81.27/81.26 | 141.0/155.1 | 82.8/65.8 | 97.6/88.4 | 92.40/94.51/88.42 | 86.02/90.17 / 78.17 | 85.88/86.42 | 89.45/89.29 | 84.58/84.95 |
mPLUGHuge | 82.27/82.41 | 142.3/158.7 | -/- | -/- | -/-/- | -/-/- | -/- | -/- | -/-/- |
- Video-text
Task | Video Retrieval | Video QA | Video Captioning | |
---|---|---|---|---|
Dataset | MSRVTT | MSRVTT-QA | MSVD-QA | VATEX |
Split | test | test | test | test(CE) |
Metric | R@1 | Acc. | Acc. | CIDEr |
mPLUG | 38.1 | 21.1 | 37.2 | 42.0 |
-
PyTorch version >= 1.11.0
-
Install other libraries via
pip install -r requirements.txt
Comming soon.
Download json files of downstream tasks
- Download VQA v2 dataset and Visual Genome dataset from the original websites VQA 2.0.
- Download and extract the provided dataset json files.
- In configs/vqa_mplug_base.yaml, set the paths for the json files and the image paths.
- Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:
sh scripts/vqa_mplug_base.sh
sh scripts/vqa_mplug_large.sh
- Evaluate the result using the official evaluation server.
- Download COCO Caption dataset from the original websites.
- Download and extract the provided dataset json files.
- Download language evalution tool(language_evalution).
- In configs/caption_mplug_base.yaml, set the paths for the json files and the image paths.
- Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:
sh scripts/caption_mplug_base.sh
sh scripts/caption_mplug_large.sh
- Download MSCOCO or Flickr30k datasets from the original websites.
- Download and extract the provided dataset json files.
- In configs/retrieval_flickr30k_mplug_large.yaml or configs/retrieval_coco_mplug_large.yaml, set the paths for the json files and the image path.
- Finetune the pre-trained checkpoint using 8 A100 GPUs:
sh scripts/retrieval_flickr30k_mplug_large.sh
sh scripts/retrieval_coco_mplug_large.sh
- Download RefCOCO datasets from the original websites.
- Download and extract the provided dataset json files.
- In configs/grounding_mplug_large.yaml, set the paths for the json files and the image path. Data preparation can follow TransVG
- Finetune the pre-trained checkpoint using 8 A100 GPUs:
sh scripts/grounding_mplug_base.sh
- Download MSRVTT datasets from the original websites.
- In configs/retrieval_msrvtt_mplug_large.yaml, set the paths for the json files and the video paths.
- To perform zero-shot evaluation, run:
sh scripts/retrieval_msrvtt_mplug_large.sh
- Download MSRVTT-QA datasets from the original websites.
- In configs/videoqa_msrvtt_mplug_base.yaml, set the paths for the json files and the video paths.
- To perform zero-shot evaluation, run:
sh scripts/videoqa_msrvtt_mplug_base.sh
- Download VATEX datasets from the original websites.
- In configs/videocap_vatex_mplug_large.yaml, set the paths for the json files and the video paths.
- To perform zero-shot evaluation, run:
sh scripts/videocap_vatex_mplug_large.sh
If you use our work, please cite:
@article{li2022mplug,
title={mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections},
author={Li, Chenliang and Xu, Haiyang and Tian, Junfeng and Wang, Wei and Yan, Ming and Bi, Bin and Ye, Jiabo and Chen, Hehong and Xu, Guohai and Cao, Zheng and others},
journal={arXiv preprint arXiv:2205.12005},
year={2022}
}
The implementation of mPLUG relies on resources from ALBEF, BLIP, and timm. We thank the original authors for their open-sourcing.