RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation

Abstract

Diffusion models currently achieve state-of-the-art performance for both conditional and unconditional image generation. However, so far, image diffusion models do not support tasks required for 3D understanding, such as view-consistent 3D generation or single-view object reconstruction. In this paper, we present RenderDiffusion as the first diffusion model for 3D generation and inference that can be trained using only monocular 2D supervision.

At the heart of our method is a novel image denoising architecture that generates and renders an intermediate three-dimensional representation of a scene in each denoising step. This enforces a strong inductive structure into the diffusion process that gives us a 3D consistent representation while only requiring 2D supervision. The resulting 3D representation can be rendered from any viewpoint.

We evaluate RenderDiffusion on ShapeNet and Clevr datasets and show competitive performance for generation of 3D scenes and inference of 3D scenes from 2D images. Additionally, our diffusion-based approach allows us to use 2D inpainting to edit 3D scenes.

Method

Our method builds on the successful training and generation setup of 2D image diffusion models, which are trained to denoise input images that have various amounts of added noise. At test time, novel images are generated by applying the model in multiple steps to progressively recover an image starting from pure noise samples.

We keep this training and generation setup, but modify the architecture of the main denoiser to encode the noisy input image into a 3D representation of the scene that is volumetrically rendered to obtain the denoised output image. This introduces an inductive bias that favors 3D scene consistency, and allows us to render the 3D representation from novel viewpoints.

3D Reconstruction

Unlike existing 2D diffusion models, we can use RenderDiffusion to reconstruct 3D scenes from 2D images. The choice of reconstruction noise level introduces an interesting control that allows us to trade off between reconstruction fidelity and generalization to out-of-distribution input images.

Out-of-Distribution Reconstruction

Using a 3D-aware denoiser allows us to reconstruct a 3D scene from noisy images, where information that is lost to the noise is filled in with generated content. By adding more noise, we can generalize to input images that are increasingly out-of-distribution, at the cost of reconstruction fidelity.

Unconditional Generation

Our method can also generate novel 3D scenes unconditionally, producing diverse and realistic outputs.

3D-Aware Inpainting

We apply our trained model to the task of inpainting masked 2D regions of an image while simultaneously reconstructing the 3D shape it shows. The model performs 3D-aware inpainting, finding a latent 3D structure that is consistent with the observed part of the image, and also plausible in the masked part.

Citation

@inproceedings{anciukevicius2023renderdiffusion,
    title={RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation},
    author={Titas Anciukevi{\v{c}}ius and Zexiang Xu and Matthew Fisher and Paul Henderson and Hakan Bilen and Niloy J. Mitra and Paul Guerrero},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2023}
}