we propose a novel hybrid vision-transformer-based encoder-decoder framework, named Query Outpainting TRansformer (QueryOTR), for extrapolating visual context all-side around a given image. Patch-wise mode's global modeling capacity allows us to extrapolate images from the attention mechanism's query standpoint. A novel Query Expansion Module (QEM) is designed to integrate information from the predicted queries based on the encoder's output, hence accelerating the convergence of the pure transformer even with a relatively small dataset. To further enhance connectivity between each patch, the proposed Patch Smoothing Module (PSM) re-allocates and averages the overlapped regions, thus providing seamless predicted images. We experimentally show that QueryOTR could generate visually appealing results smoothly and realistically against the state-of-the-art image outpainting approaches.
PyTorch >= 1.10.1; python >= 3.7; CUDA >= 11.3; torchvision;
NOTE: The code was tested to work well on Linux with torch 1.7, 1.9 and Win10 with torch 1.10.1. However, there is potential "Inplace Operation Error" bug if you use PyTorch < 1.10, which is quiet subtle and not fixed. If you found why the bug occur, pls let me know.
[2022/11/7] We update the code. We found the official MAE code may degrade the performance by unkonwn reason (about 0.5-1 in terms of FID), and we go back to unofficial MAE. Meanwhile, we upload a trained checkpoints on Scenery google drive which can reach FID 20.38, IS 3.959. It worth noting that the result may change due to the randomness of the code, e.g., one of the input of QEM is noise.
Scenery consists of about 6,000 images, and we randomly select 1,000 images for evaluation. The training and test dataset can be down here
Meanwhile, we also provide the Scenery dataset that we have split here baidu_pan.
Building contains about 16,000 images in the training set and 1,500 images in the testing set, which can be found in here
The WikiArt datasets can be downloaded here. We perform a split manner of genres datasets, which contains 45,503 training images and 19,492 testing images
Before you reimplement our results, you need to download the ViT pretrain checkpoint here, and then initialize the encoder weight.
Training on your datasets, run:
CUDA_VISIBLE_DEVICES=<GPUs> python main.py --name=EXPERIMENT_NAME --data_root=YOUR_TRAIN_PATH --patch_mean=YOUR_PATCH_MEAN --patch_std=YOUR_PATCH_STD
Evaluate on your datasets, run:
CUDA_VISIBLE_DEVICES=<GPUs> python evaluate.py --r=EXPERIMENT_NAME --data_root=YOUR_TEST_PATH --patch_mean=YOUR_PATCH_MEAN --patch_std=YOUR_PATCH_STD
Our codes are built upon MAE, pytroch-fid and inception score
@inproceedings{yao2022qotr,
title={Outpainting by Queries},
author={Yao, Kai and Gao, Penglei and Yang, Xi and Sun, Jie and Zhang, Rui and Huang, Kaizhu},
booktitle={European Conference on Computer Vision},
pages={153--169},
year={2022},
organization={Springer}
}