Show-o

One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie^1* Weijia Mao^1* Zechen Bai^1* David Junhao Zhang^1*
Weihao Wang² Kevin Qinghong Lin¹ Yuchao Gu¹ Zhijie Chen² Zhenheng Yang² Mike Zheng Shou¹

¹ Show Lab, National University of Singapore ² Bytedance

An overview of Show-o. The input data, regardless of its modalities, is tokenized and then prompted into a formatted input sequence. Show-o processes text tokens autoregressively with causal attention and image tokens in (discrete) denoising diffusion modeling via full attention, and then generates the desired output. Specifically, Show-o is capable of handling image captioning, visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed modality generation.

Characteristics comparison among understanding only, generation only, and unified (understanding & generation) models. Vision and Language indicate the representations from specific input modalities. In this context, Diffusion represents both continuous and discrete diffusion.

News

[2024-08-23] We release the inference code of Show-o (1.3B) for multimodal understanding and generation including image captioning, visual question answering (VQA), text-to-image generation, text-guided inpaitning and extrapolation.

TODO

Release the inference code.
Release the training code (in the coming weeks).
Scale up the model size (based on LLaMA3) and increase the number of training data.

Getting Started

First, set up the environment:

pip3 install -r requirments.txt

Download model weight of a pre-trained LLM (Phi-1.5):

git lfs install
git clone https://huggingface.co/microsoft/phi-1_5

Download model weights of Show-o and put them to a directory in the structure below:

├── checkpoints/ 
|   ├── magvitv2.pth
|   ├── showo.bin
|   ├── showo_w_clip_vit.bin
|   ├── phi-1_5

Login your wandb account on your machine or server.

wandb login <your wandb keys>

Inference demo for Multimodal Understanding and you can view the results on wandb.

python3 inference_mmu.py config=configs/showo_demo_w_clip_vit.yaml \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?' \
pretrained_model_path=./checkpoints/showo_w_clip_vit.bin

Inference demo for Text-to-Image Generation and you can view the results on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=32 validation_prompts_file=validation_prompts/showoprompts.txt \
guidance_scale=1.75 generation_timesteps=18 \
mode='t2i' pretrained_model_path=./checkpoints/showo.bin

Inference demo for Text-guided Inpainting and you can view the results on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=32 \
guidance_scale=1.75 generation_timesteps=16 \
pretrained_model_path=./checkpoints/showo.bin \
mode='inpainting' prompt='A blue sports car with sleek curves and tinted windows, parked on a bustling city street.' \
image_path=./inpainting_validation/bus.jpg inpainting_mask_path=./inpainting_validation/bus_mask.webp

Inference demo for Text-guided Extrapolation and you can view the results on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=32 \
guidance_scale=1.75 generation_timesteps=16 \
pretrained_model_path=./checkpoints/showo.bin \
mode='extrapolation' extra_direction='left *** left *** left *** right *** right *** right' offset=0 prompt='a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees.' \
image_path=./inpainting_validation/alpine_lake.jpg

Citation

To cite the paper and model, please use the below:

@article{xie2024showo,
  title={Show-o: One Single Transformer to Unify Multimodal Understanding and Generation},
  author={Xie, Jinheng and Mao, Weijia and Bai, Zechen and Zhang, David Junhao and Wang, Weihao and Lin, Kevin Qinghong and Gu, Yuchao and Chen, Zhijie and Yang, Zhenheng and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2408.12528},
  year={2024}
}

Acknowledgments

This work is heavily based on open-muse, Phi-1.5, muse-maskgit-pytorch, maskgit, taming-transformers, transformers, accelerate, diffusers, and webdatset. Thanks to all the authors for their great work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Show-o

One Single Transformer to Unify Multimodal Understanding and Generation

News

TODO

Getting Started

Citation

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
configs		configs
docs		docs
inpainting_validation		inpainting_validation
mmu_validation		mmu_validation
models		models
training		training
validation_prompts		validation_prompts
LICENSE		LICENSE
README.md		README.md
inference_mmu.py		inference_mmu.py
inference_t2i.py		inference_t2i.py
requirements.txt		requirements.txt

License

eltociear/Show-o

Folders and files

Latest commit

History

Repository files navigation

Show-o

One Single Transformer to Unify Multimodal Understanding and Generation

News

TODO

Getting Started

Citation

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages