We represent a 3D scene using 6D ( three dimensions for RGB and three for XYZ coordinates) videos. We use a diffusion model to learn the joint distribution of multi-view RGB and XYZ frames, conditioned on a single image.
An interactive viewer for Image-to-3D
Select conditioning image:
Synthesized RGB frames:
Synthesized XYZ frames:
3D point cloud:
During training, the model learns the joint distribution of RGB and XYZ frames: P(RGB, XYZ). However, we can estimate conditional distributions via inpainting in inference time, for example P(RGB | XYZ), or P(XYZ | RGB).
This enables the diffusion model to be adapted for a variety of downstream applications in a zero-shot manner:
We estimate the joint distribution of RGB and XYZ: P(RGB, XYZ) given the single conditioned view. Then, we convert the predicted XYZ map of the input view \(\hat{\mathbf{x}}^\textrm{XYZ}_{u, v}\) into depth map \(\mathbf{d}\) via optimizing: \[ \min_{P, K, \mathbf{d}} \sum_{u, v}\| \tilde{\mathbf{x}}^\textrm{XYZ}_{u, v} - \hat{\mathbf{x}}^\textrm{XYZ}_{u, v} \|^2_2, \] where \(\tilde{\mathbf{x}}^\textrm{XYZ}_{u, v} = P^{-1}K^{-1}\mathbf{d}_{u, v}[u, v, 1]^\text{T}\), and \(P, K\) represent the camera extrinsic and intrinsic matrices.
We estimate the conditional distribution of XYZ: P(XYZ | RGB) given the whole RGB frames. Then, we convert the predicted XYZ map of the input view \(\hat{\mathbf{x}}^\textrm{XYZ}_{u, v}\) into depth map \(\mathbf{d}\) via optimizing: \[ \min_{P, K, \mathbf{d}} \sum_{u, v}\| \tilde{\mathbf{x}}^\textrm{XYZ}_{u, v} - \hat{\mathbf{x}}^\textrm{XYZ}_{u, v} \|^2_2, \] where \(\tilde{\mathbf{x}}^\textrm{XYZ}_{u, v} = P^{-1}K^{-1}\mathbf{d}_{u, v}[u, v, 1]^\text{T}\), and \(P, K\) represent the camera extrinsic and intrinsic matrices.
We estimate the conditional distribution of XYZ: P(XYZ | RGB) given the whole RGB frames. Then, we convert the predicted XYZ map of the input view \(\hat{\mathbf{x}}^\textrm{XYZ}_{u, v}\) into camera parameters \(P,K\) via optimizing: \[ \min_{P, K, \mathbf{d}} \sum_{u, v}\| \tilde{\mathbf{x}}^\textrm{XYZ}_{u, v} - \hat{\mathbf{x}}^\textrm{XYZ}_{u, v} \|^2_2, \] where \(\tilde{\mathbf{x}}^\textrm{XYZ}_{u, v} = P^{-1}K^{-1}\mathbf{d}_{u, v}[u, v, 1]^\text{T}\), and \(P, K\) represent the camera extrinsic and intrinsic matrices.
For realize camera-controlled video generation, we follow the following steps:
@inproceedings{zhang2024world,
author = {Qihang Zhang and Shuangfei Zhai and Miguel Ángel Bautista and Kevin Miao and Alexander Toshev and Joshua Susskind and Jiatao Gu},
title = {World-consistent Video Diffusion with Explicit 3D Modeling},
booktitle = {arXiv},
year = {2024}
}
We borrow the source code of this website from GeNVS.