WVD

How it works

We represent a 3D scene using 6D ( three dimensions for RGB and three for XYZ coordinates) videos. We use a diffusion model to learn the joint distribution of multi-view RGB and XYZ frames, conditioned on a single image.

An interactive viewer for Image-to-3D

Select conditioning image:

Synthesized RGB frames:

Synthesized XYZ frames:

3D point cloud:

Various downstream tasks

During training, the model learns the joint distribution of RGB and XYZ frames: P(RGB, XYZ). However, we can estimate conditional distributions via inpainting in inference time, for example P(RGB | XYZ), or P(XYZ | RGB). This enables the diffusion model to be adapted for a variety of downstream applications in a zero-shot manner:

Figure 1: Monocular Depth Estimation Pipeline

We estimate the joint distribution of RGB and XYZ: P(RGB, XYZ) given the single conditioned view. Then, we convert the predicted XYZ map of the input view \(\hat{\mathbf{x}}^\textrm{XYZ}_{u, v}\) into depth map \(\mathbf{d}\) via optimizing: \[ \min_{P, K, \mathbf{d}} \sum_{u, v}\| \tilde{\mathbf{x}}^\textrm{XYZ}_{u, v} - \hat{\mathbf{x}}^\textrm{XYZ}_{u, v} \|^2_2, \] where \(\tilde{\mathbf{x}}^\textrm{XYZ}_{u, v} = P^{-1}K^{-1}\mathbf{d}_{u, v}[u, v, 1]^\text{T}\), and \(P, K\) represent the camera extrinsic and intrinsic matrices.

Figure 2: Monocular Depth Estimation Results

Figure 3: Video Depth Estimation Pipeline

We estimate the conditional distribution of XYZ: P(XYZ | RGB) given the whole RGB frames. Then, we convert the predicted XYZ map of the input view \(\hat{\mathbf{x}}^\textrm{XYZ}_{u, v}\) into depth map \(\mathbf{d}\) via optimizing: \[ \min_{P, K, \mathbf{d}} \sum_{u, v}\| \tilde{\mathbf{x}}^\textrm{XYZ}_{u, v} - \hat{\mathbf{x}}^\textrm{XYZ}_{u, v} \|^2_2, \] where \(\tilde{\mathbf{x}}^\textrm{XYZ}_{u, v} = P^{-1}K^{-1}\mathbf{d}_{u, v}[u, v, 1]^\text{T}\), and \(P, K\) represent the camera extrinsic and intrinsic matrices.

Video: Video Depth Estimation Results. From left to right: input videos, ground-truth depth frames, and predicted depth frames by our method.

We estimate the conditional distribution of XYZ: P(XYZ | RGB) given the whole RGB frames. Then, we convert the predicted XYZ map of the input view \(\hat{\mathbf{x}}^\textrm{XYZ}_{u, v}\) into camera parameters \(P,K\) via optimizing: \[ \min_{P, K, \mathbf{d}} \sum_{u, v}\| \tilde{\mathbf{x}}^\textrm{XYZ}_{u, v} - \hat{\mathbf{x}}^\textrm{XYZ}_{u, v} \|^2_2, \] where \(\tilde{\mathbf{x}}^\textrm{XYZ}_{u, v} = P^{-1}K^{-1}\mathbf{d}_{u, v}[u, v, 1]^\text{T}\), and \(P, K\) represent the camera extrinsic and intrinsic matrices.

Figure 5: Camera Estimation Results. (a): Input RGB frames; (b) Estimated XYZ frames; (c) Predicted camera trajectory; (d) Ground-truth camera trajectory

Camear-controlled Video Generation Pipeline — Figure 6: Camera-controlled Video Generation Pipeline

For realize camera-controlled video generation, we follow the following steps:

1. Estimate the joint distribution of RGB and XYZ: P(RGB, XYZ), given the single conditioned RGB image.
2. Project the point cloud, represented by the XYZ map of the input view, to novel views to generate XYZ images.
3. Estimate the conditional distribution of RGB: P(RGB | XYZ), given the projected XYZ images.

Video: Camera-controlled Video Generation Results. For each sample, the left shows ground-truth videos, and the right shows synthesized videos which re-produce the camera trajectory.

Citation

@inproceedings{zhang2024world,
              author = {Qihang Zhang and Shuangfei Zhai and Miguel Ángel Bautista and Kevin Miao and Alexander Toshev and Joshua Susskind and Jiatao Gu},
              title = {World-consistent Video Diffusion with Explicit 3D Modeling},
              booktitle = {arXiv},
              year = {2024}
            }

Acknowledgments

We borrow the source code of this website from GeNVS.

World-consistent Video Diffusion with Explicit 3D Modeling

CVPR 2025 Highlight

How it works

Various downstream tasks

Citation

Acknowledgments