Jinheng Xie1*
Weijia Mao1*
Zechen Bai1*
David Junhao Zhang1*
Weihao Wang2
Kevin Qinghong Lin1
Yuchao Gu1
Zhijie Chen2
Zhenheng Yang2
Mike Zheng Shou1
1 Show Lab, National University of Singapore 2 Bytedance
An overview of Show-o. The input data, regardless of its modalities, is tokenized and then prompted into a formatted input sequence. Show-o processes text tokens autoregressively with causal attention and image tokens in (discrete) denoising diffusion modeling via full attention, and then generates the desired output. Specifically, Show-o is capable of handling image captioning, visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed modality generation.
Characteristics comparison among understanding only, generation only, and unified (understanding & generation) models. Vision
and Language
indicate the representations from specific input modalities. In this context, Diffusion
represents both continuous and discrete diffusion.
- [2024-08-23] We release the inference code of Show-o (1.3B) for multimodal understanding and generation including image captioning, visual question answering (VQA), text-to-image generation, text-guided inpaitning and extrapolation.
- Release the inference code.
- Release the training code (in the coming weeks).
- Scale up the model size (based on LLaMA3) and increase the number of training data.
First, set up the environment:
pip3 install -r requirments.txt
Download model weight of a pre-trained LLM (Phi-1.5):
git lfs install
git clone https://huggingface.co/microsoft/phi-1_5
Download model weights of Show-o and put them to a directory in the structure below:
├── checkpoints/
| ├── magvitv2.pth
| ├── showo.bin
| ├── showo_w_clip_vit.bin
| ├── phi-1_5
Login your wandb account on your machine or server.
wandb login <your wandb keys>
Inference demo for Multimodal Understanding and you can view the results on wandb.
python3 inference_mmu.py config=configs/showo_demo_w_clip_vit.yaml \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?' \
pretrained_model_path=./checkpoints/showo_w_clip_vit.bin
Inference demo for Text-to-Image Generation and you can view the results on wandb.
python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=32 validation_prompts_file=validation_prompts/showoprompts.txt \
guidance_scale=1.75 generation_timesteps=18 \
mode='t2i' pretrained_model_path=./checkpoints/showo.bin
Inference demo for Text-guided Inpainting and you can view the results on wandb.
python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=32 \
guidance_scale=1.75 generation_timesteps=16 \
pretrained_model_path=./checkpoints/showo.bin \
mode='inpainting' prompt='A blue sports car with sleek curves and tinted windows, parked on a bustling city street.' \
image_path=./inpainting_validation/bus.jpg inpainting_mask_path=./inpainting_validation/bus_mask.webp
Inference demo for Text-guided Extrapolation and you can view the results on wandb.
python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=32 \
guidance_scale=1.75 generation_timesteps=16 \
pretrained_model_path=./checkpoints/showo.bin \
mode='extrapolation' extra_direction='left *** left *** left *** right *** right *** right' offset=0 prompt='a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees.' \
image_path=./inpainting_validation/alpine_lake.jpg
To cite the paper and model, please use the below:
@article{xie2024showo,
title={Show-o: One Single Transformer to Unify Multimodal Understanding and Generation},
author={Xie, Jinheng and Mao, Weijia and Bai, Zechen and Zhang, David Junhao and Wang, Weihao and Lin, Kevin Qinghong and Gu, Yuchao and Chen, Zhijie and Yang, Zhenheng and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2408.12528},
year={2024}
}
This work is heavily based on open-muse, Phi-1.5, muse-maskgit-pytorch, maskgit, taming-transformers, transformers, accelerate, diffusers, and webdatset. Thanks to all the authors for their great work.