File size: 4,010 Bytes
7b4b4f6 f8add11 7b4b4f6 f8add11 7b4b4f6 f8add11 294b645 f8add11 ff145bd f8add11 ff145bd f8add11 ff145bd f8add11 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | ---
license: mit
tags:
- novel-view-synthesis
- image-to-image
- computer-vision
- pytorch
- model_hub_mixin
- pytorch_model_hub_mixin
pipeline_tag: image-to-image
arxiv: 2603.23488
---
# OVIE — One View Is Enough!
**Monocular Training for In-the-Wild Novel View Generation**
[](https://kyutai.org/blog/2026-04-14-ovie)
[](https://arxiv.org/abs/2603.23488)
[](https://github.com/AdrienRR/ovie)
[](https://github.com/AdrienRR/ovie/blob/main/LICENSE)
OVIE is a novel view synthesis model that generates a new viewpoint of a scene from a **single image** and a **target camera pose**. Unlike most prior work, it is trained entirely on **unpaired in-the-wild images** — no multi-view supervision required.

---
## Model architecture
OVIE is a convolutional encoder–decoder with a Vision Transformer (ViT) bottleneck conditioned on camera parameters via adaptive layer normalisation (AdaLN):
- **Encoder**: cascaded downsampling ConvBlocks (3 scales)
- **Bottleneck**: 12-layer ViT (hidden size 768, 12 heads) with AdaLN camera conditioning
- **Decoder**: cascaded upsampling ConvBlocks (3 scales)
- **Camera conditioning**: a 7-dimensional pose encoding (rotation + translation) projected into the ViT hidden space
- **Parameters**: ~143M
---
## Usage
```python
import torch
from models.models import OVIEModel
from utils.pose_enc import extri_intri_to_pose_encoding
from torchvision.transforms import ToTensor
from PIL import Image
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model
model = OVIEModel.from_pretrained("kyutai/ovie", revision="v1.0").to(device)
model.eval()
image_size = model.image_size # 256
# Prepare input image
img_pil = Image.open("image.jpg").convert("RGB").resize((image_size, image_size))
img_tensor = ToTensor()(img_pil).unsqueeze(0).to(device)
# Define target camera pose (3x4 extrinsics)
extrinsics = torch.tensor([[[1.0, 0.0, 0.0, -1.25],
[0.0, 1.0, 0.0, 0.5],
[0.0, 0.0, 1.0, -2.0]]], device=device)
dummy_intrinsics = torch.zeros(1, 1, 3, 3, device=device)
camera = extri_intri_to_pose_encoding(
extrinsics=extrinsics.unsqueeze(0),
intrinsics=dummy_intrinsics,
image_size_hw=(image_size, image_size),
)
cam_token = camera[..., :7].squeeze(0)
# Generate novel view
with torch.no_grad():
pred = model(x=img_tensor, cam_params=cam_token)
# pred: (1, 3, 256, 256) tensor in [0, 1]
```
See the repository for full installation instructions and example notebooks:
- `inference_huggingface.ipynb` — loads directly from this Hub page
- `inference_local.ipynb` — loads from a local checkpoint
---
## Training
OVIE is trained on a diverse mix of in-the-wild internet images (ImageNet, Places365, OSV5M, OpenImages) with **no multi-view pairs**. Training uses a combination of L2 reconstruction loss, LPIPS perceptual loss, and an adversarial loss with a DINO-based discriminator. Camera poses are sampled synthetically from a distribution of plausible viewpoint changes.
---
## Evaluation
The model is evaluated on DL3DV and Real Estate 10K (RE10K) using PSNR, SSIM, and LPIPS. See the [paper](https://arxiv.org/abs/2603.23488) for full quantitative results.
---
## Citation
```bibtex
@misc{ovie2026,
title={One View Is Enough! Monocular Training for In-the-Wild Novel View Generation},
author={Adrien Ramanana Rahary and Nicolas Dufour and Patrick Perez and David Picard},
year={2026},
eprint={2603.23488},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.23488},
}
```
|