Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen.
In this work, we investigate the problem of creating high-fidelity 3D content from only a single image. This is inherently challenging: it essentially involves estimating the underlying 3D geometry while simultaneously hallucinating unseen textures. To address this challenge, we leverage prior knowledge from a well-trained 2D diffusion model to act as 3D-aware supervision for 3D creation. Our approach, Make-It-3D, employs a two-stage optimization pipeline: the first stage optimizes a neural radiance field by incorporating constraints from the reference image at the frontal view and diffusion prior at novel views; the second stage transforms the coarse model into textured point clouds and further elevates the realism with diffusion prior while leveraging the high-quality textures from the reference image. Extensive experiments demonstrate that our method outperforms prior works by a large margin, resulting in faithful reconstructions and impressive visual quality. Our method presents the first attempt to achieve high-quality 3D creation from a single image for general objects and enables various applications such as text-to-3D creation and texture editing.
- Release coarse stage training code
- Release all training code (coarse refine stage)
- Release the test alpha data for all results in the paper
- Release more applications
Install with pip:
pip install torch==1.10.0 cu113 torchvision==0.11.1 cu113 torchaudio===0.10.0 cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install git https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
pip install git https://github.com/openai/CLIP.git
pip install git https://github.com/huggingface/diffusers.git
pip install git https://github.com/huggingface/huggingface_hub.git
pip install git https://github.com/facebookresearch/pytorch3d.git
pip install git https://github.com/CarolusSolis/contextual_loss_pytorch.git
Other dependencies:
pip install -r requirements.txt
pip install ./raymarching
Training requirements
- DPT. We use an off-the-shelf single-view depth estimator DPT to predict the depth for the reference image.
Download the pretrained model dpt_hybrid, and put it in
git clone https://github.com/isl-org/DPT.git mkdir dpt_weights
dpt_weights
. - SAM. We use Segment-anything-model to obtain the foreground object mask.
- BLIP2. We use BLIP2 to generate a caption. You can also modify the conditioned text using
--text "{TEXT}"
which will greatly reduce time. - Stable Diffusion. We use diffusion prior from a pretrained 2D Stable Diffusion 2.0 model. To start with, you may need a huggingface token to access the model, or use
huggingface-cli login
command.
We use progressive training strategy to generate a full 360° 3D geometry. Run the command and modify the workspace name NAME
and the path of the reference image IMGPATH
. We first optimize the scene under frontal camera views.
python main.py --workspace ${NAME} --ref_path "${IMGPATH}" --phi_range 135 225 --iters 2000
Then we spread the camera view samples to full 360°. If you need a prompt condition "back view", you can use the command --need_back
.
python main.py --workspace ${NAME} --ref_path "${IMGPATH}" --phi_range 0 360 --albedo_iters 3500 --iters 5000 --final
If you encounter long geometry
issue, you can try to increase the reference fov and adjust relative setting. For example:
python main.py --workspace ${NAME} --ref_path "${IMGPATH}" --phi_range 135 225 --iters 2000 --fov 60 --fovy_range 50 70 --blob_radius 0.2
After the coarse stage training, now you can easily use the command --refine
for refine stage training. We optimize the scene under frontal camera views.
python main.py --workspace ${NAME} --ref_path "${IMGPATH}" --phi_range 135 225 --refine
You can modify the value of training iterations using the command --refine_iters
.
python main.py --workspace ${NAME} --ref_path "${IMGPATH}" --phi_range 135 225 --refine_iters 3000 --refine
Note: We additionally use contextual loss
on the refine stage, we find it helps to sharpen the texture. You may need to install contextual_loss_pytorch before training.
pip install git https://github.com/S-aiueo32/contextual_loss_pytorch.git
Hallucinating 3D geometry and generating novel views from a single image of general genre is a challenging task. While our method demonstrates strong capability on creating 3D from most images with a centered single object, it may still encounter difficulties in reconstructing solid geometry on complex cases. If you encounter any bugs, please feel free to contact us.
If you find this code helpful for your research, please cite:
@InProceedings{Tang_2023_ICCV,
author = {Tang, Junshu and Wang, Tengfei and Zhang, Bo and Zhang, Ting and Yi, Ran and Ma, Lizhuang and Chen, Dong},
title = {Make-It-3D: High-fidelity 3D Creation from A Single Image with Diffusion Prior},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {22819-22829}
}
This code borrows heavily from Stable-Dreamfusion.