Haoqin Tu*, Bingchen Zhao*, Chen Wei, Cihang Xie (*Equal Contribution)
Our paper is online now: https://arxiv.org/abs/2309.07120
Please follow LLaVA for setting up the training environment.
We list all the model and vision-text projector weights used in the paper
Model | Pretrain Weights | Instruction Tuned Weights |
---|---|---|
LLaMA-7B | ckpt | Finetune ckpt |
Vicuna-7B | ckpt | Finetune ckpt |
LLaMA-3B | ckpt | Finetune ckpt |
LoRA ckpt | ||
Alpaca-3B | ckpt | Finetune ckpt |
LoRA ckpt | ||
LLaMA2-7B | ckpt | Finetune ckpt |
LoRA ckpt | ||
LLaMA2-chat-7B | ckpt | Finetune ckpt |
LoRA ckpt |
For NLP & Multi-Modal data and evaluations, please see instructions here.
We follow the training paradigm of LLaVA, which consists of two stages: (1) feature alignment: use approximately 600K filtered CC3M to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning: use filtered 80K GPT-generated visual instruction data (see here) to teach the model to follow multimodal instructions.
Please download the subset of the CC3M dataset we use in the paper here. You can check the pretraining script
Pretrain: LLaMA2-7B.
deepspeed llava/train/train.py --deepspeed scripts/zero3.json \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--version v0 \
--data_path /path/to/cc3m_595k.json \
--image_folder /path/to/cc3m_595k_images \
--vision_tower openai/clip-vit-large-patch14 \
--tune_mm_mlp_adapter True \
--mm_vision_select_layer -2 \
--mm_use_im_start_end True \
--bf16 True \
--output_dir ./checkpoints/MM-LLaMA2-7B-pretrain \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2400 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to wandb
- Data preparation: Please download llava_instruct_80k.json and COCO train2017 images here
- Training: You can download our pretrained projector here, and check the finetuning script or LoRA tuning script.
Visual Instruction Tuning: MM-LLaMA2-7B-ft.
deepspeed llava/train/train.py --deepspeed scripts/zero2.json \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--version llava_llama_2 \
--data_path path/to/llava_instruct_80k.json \
--image_folder /path/to/coco/train2017/ \
--vision_tower openai/clip-vit-large-patch14 \
--pretrain_mm_mlp_adapter ./checkpoints/MM-LLaMA2-7B-pretrain/mm_projector.bin \
--mm_vision_select_layer -2 \
--mm_use_im_start_end True \
--bf16 True \
--output_dir ./checkpoints/MM-LLaMA2-7B-ft \
--num_train_epochs 1 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 5000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
If you find this repo useful for your your research and applications, please cite using this BibTeX:
@article{tu2023sight,
title={Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics},
author={Tu, Haoqin and Zhao, Bingchen and Wei, Chen and Xie, Cihang},
journal={arXiv preprint arXiv:2309.07120},
year={2023}
}
This work is partially supported by a gift from Open Philanthropy. We thank Center for AI Safety for supporting our computing needs. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.
- Our training codes are largely borrow from LLaVA, which is truly an amazing resource.