Skip to content

Latest commit

 

History

History

image_text

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Image/text models

LiT: Zero-Shot Transfer with Locked-image text Tuning

by Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer

https://arxiv.org/abs/2111.07991

@article{zhai2022lit,
  title={LiT: Zero-Shot Transfer with Locked-image Text Tuning},
  author={Zhai, Xiaohua and Wang, Xiao and Mustafa, Basil and Steiner, Andreas and Keysers, Daniel and Kolesnikov, Alexander and Beyer, Lucas},
  journal={CVPR},
  year={2022}
}

Model card: https://github.com/google-research/vision_transformer/blob/main/model_cards/lit.md

Colabs:

Results

Model Download link ImageNet 0-shot MS-COCO I→T MS-COCO T→I Config arg
mixed_L16L link 75.7 48.5 31.2 txt=bert_large,img=L/16
mixed_B16B link 72.1 49.4 31.1 txt=bert_base,img=B/16,img_head
mixed_B16B_2 link 73.9 51.5 31.8 txt=bert_base,img=B/16
coco_B16B link 20.7 47.2 32.1 txt=bert_base,img=B/16

The first three rows are the best available models trained on open source data, originally published in the google-research/vision_transformer repository. These models were re-evaluated with this codebase using the following commands:

big_vision.tools.eval_only --config big_vision/configs/proj/image_text/lit_coco.py:txt=bert_base,img=B/16,img_head,init=gs://vit_models/lit/LiT-B16B.npz

big_vision.tools.eval_only --config big_vision/configs/proj/image_text/lit_coco.py:txt=bert_base,img=B/16_2,init=gs://vit_models/lit/LiT-B16B_2.npz

big_vision.tools.eval_only --config big_vision/configs/proj/image_text/lit_coco.py:txt=bert_large,img=L/16,init=gs://vit_models/lit/LiT-L16L.npz

Unfortunately, the public multi-modal datasets CC12M and YFCC100M are not yet available in tfds, so these models cannot be reproduced with the codebase. For this reason we provide the much weaker model coco_B16B in the third row, which was trained on the small tfds dataset coco_captions, and can be used to verify correctness of the codebase (workdir).

Changelog

  • 2022-08-18: Added LiT-B16B_2 model that was trained for 60k steps (LiT_B16B: 30k) without linear head on the image side (LiT_B16B: 768) and has better performance.