File size: 4,010 Bytes
7b4b4f6
f8add11
7b4b4f6
f8add11
 
 
 
 
 
 
 
7b4b4f6
 
f8add11
 
 
 
294b645
f8add11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ff145bd
f8add11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ff145bd
f8add11
 
 
 
 
ff145bd
f8add11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: mit
tags:
  - novel-view-synthesis
  - image-to-image
  - computer-vision
  - pytorch
  - model_hub_mixin
  - pytorch_model_hub_mixin
pipeline_tag: image-to-image
arxiv: 2603.23488
---

# OVIE — One View Is Enough!

**Monocular Training for In-the-Wild Novel View Generation**

[![Project Page](https://img.shields.io/badge/Project_Page-green?logo=googlechrome&logoColor=white)](https://kyutai.org/blog/2026-04-14-ovie)
[![Paper](https://img.shields.io/badge/arXiv-2603.23488-red?logo=arxiv)](https://arxiv.org/abs/2603.23488)
[![GitHub](https://img.shields.io/badge/GitHub-AdrienRR%2Fovie-black?logo=github)](https://github.com/AdrienRR/ovie)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/AdrienRR/ovie/blob/main/LICENSE)

OVIE is a novel view synthesis model that generates a new viewpoint of a scene from a **single image** and a **target camera pose**. Unlike most prior work, it is trained entirely on **unpaired in-the-wild images** — no multi-view supervision required.

![OVIE teaser](https://raw.githubusercontent.com/AdrienRR/ovie/main/assets/teaser.jpeg)

---

## Model architecture

OVIE is a convolutional encoder–decoder with a Vision Transformer (ViT) bottleneck conditioned on camera parameters via adaptive layer normalisation (AdaLN):

- **Encoder**: cascaded downsampling ConvBlocks (3 scales)
- **Bottleneck**: 12-layer ViT (hidden size 768, 12 heads) with AdaLN camera conditioning
- **Decoder**: cascaded upsampling ConvBlocks (3 scales)
- **Camera conditioning**: a 7-dimensional pose encoding (rotation + translation) projected into the ViT hidden space
- **Parameters**: ~143M

---

## Usage

```python
import torch
from models.models import OVIEModel
from utils.pose_enc import extri_intri_to_pose_encoding
from torchvision.transforms import ToTensor
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model
model = OVIEModel.from_pretrained("kyutai/ovie", revision="v1.0").to(device)
model.eval()
image_size = model.image_size  # 256

# Prepare input image
img_pil = Image.open("image.jpg").convert("RGB").resize((image_size, image_size))
img_tensor = ToTensor()(img_pil).unsqueeze(0).to(device)

# Define target camera pose (3x4 extrinsics)
extrinsics = torch.tensor([[[1.0, 0.0, 0.0, -1.25],
                            [0.0, 1.0, 0.0,  0.5],
                            [0.0, 0.0, 1.0, -2.0]]], device=device)
dummy_intrinsics = torch.zeros(1, 1, 3, 3, device=device)

camera = extri_intri_to_pose_encoding(
    extrinsics=extrinsics.unsqueeze(0),
    intrinsics=dummy_intrinsics,
    image_size_hw=(image_size, image_size),
)
cam_token = camera[..., :7].squeeze(0)

# Generate novel view
with torch.no_grad():
    pred = model(x=img_tensor, cam_params=cam_token)
# pred: (1, 3, 256, 256) tensor in [0, 1]
```

See the repository for full installation instructions and example notebooks:
- `inference_huggingface.ipynb` — loads directly from this Hub page
- `inference_local.ipynb` — loads from a local checkpoint

---

## Training

OVIE is trained on a diverse mix of in-the-wild internet images (ImageNet, Places365, OSV5M, OpenImages) with **no multi-view pairs**. Training uses a combination of L2 reconstruction loss, LPIPS perceptual loss, and an adversarial loss with a DINO-based discriminator. Camera poses are sampled synthetically from a distribution of plausible viewpoint changes.

---

## Evaluation

The model is evaluated on DL3DV and Real Estate 10K (RE10K) using PSNR, SSIM, and LPIPS. See the [paper](https://arxiv.org/abs/2603.23488) for full quantitative results.

---

## Citation

```bibtex
@misc{ovie2026,
      title={One View Is Enough! Monocular Training for In-the-Wild Novel View Generation},
      author={Adrien Ramanana Rahary and Nicolas Dufour and Patrick Perez and David Picard},
      year={2026},
      eprint={2603.23488},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.23488},
}
```