by Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer
https://arxiv.org/abs/2111.07991
@article{zhai2022lit,
title={LiT: Zero-Shot Transfer with Locked-image Text Tuning},
author={Zhai, Xiaohua and Wang, Xiao and Mustafa, Basil and Steiner, Andreas and Keysers, Daniel and Kolesnikov, Alexander and Beyer, Lucas},
journal={CVPR},
year={2022}
}
Model card: https://github.com/google-research/vision_transformer/blob/main/model_cards/lit.md
Colabs:
- https://colab.research.google.com/github/google-research/vision_transformer/blob/main/lit.ipynb
- https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/lit.ipynb
Model | Download link | ImageNet 0-shot | MS-COCO I→T | MS-COCO T→I | Config arg |
---|---|---|---|---|---|
mixed_L16L | link | 75.7 | 48.5 | 31.2 | txt=bert_large,img=L/16 |
mixed_B16B | link | 72.1 | 49.4 | 31.1 | txt=bert_base,img=B/16,img_head |
mixed_B16B_2 | link | 73.9 | 51.5 | 31.8 | txt=bert_base,img=B/16 |
coco_B16B | link | 20.7 | 47.2 | 32.1 | txt=bert_base,img=B/16 |
The first three rows are the best available models trained on open source data,
originally published in the google-research/vision_transformer
repository.
These models were re-evaluated with this codebase using the following commands:
big_vision.tools.eval_only --config big_vision/configs/proj/image_text/lit_coco.py:txt=bert_base,img=B/16,img_head,init=gs://vit_models/lit/LiT-B16B.npz
big_vision.tools.eval_only --config big_vision/configs/proj/image_text/lit_coco.py:txt=bert_base,img=B/16_2,init=gs://vit_models/lit/LiT-B16B_2.npz
big_vision.tools.eval_only --config big_vision/configs/proj/image_text/lit_coco.py:txt=bert_large,img=L/16,init=gs://vit_models/lit/LiT-L16L.npz
Unfortunately, the public multi-modal datasets CC12M
and YFCC100M
are
not yet available in tfds
, so these models cannot be reproduced with the
codebase. For this reason we provide the much weaker model coco_B16B
in the
third row, which was trained on the small tfds
dataset coco_captions
, and
can be used to verify correctness of the codebase
(workdir).
- 2022-08-18: Added LiT-B16B_2 model that was trained for 60k steps (LiT_B16B: 30k) without linear head on the image side (LiT_B16B: 768) and has better performance.