| --- |
| license: mit |
| tags: |
| - novel-view-synthesis |
| - image-to-image |
| - computer-vision |
| - pytorch |
| - model_hub_mixin |
| - pytorch_model_hub_mixin |
| pipeline_tag: image-to-image |
| arxiv: 2603.23488 |
| --- |
| |
| # OVIE — One View Is Enough! |
|
|
| **Monocular Training for In-the-Wild Novel View Generation** |
|
|
| [](https://kyutai.org/blog/2026-04-14-ovie) |
| [](https://arxiv.org/abs/2603.23488) |
| [](https://github.com/AdrienRR/ovie) |
| [](https://github.com/AdrienRR/ovie/blob/main/LICENSE) |
|
|
| OVIE is a novel view synthesis model that generates a new viewpoint of a scene from a **single image** and a **target camera pose**. Unlike most prior work, it is trained entirely on **unpaired in-the-wild images** — no multi-view supervision required. |
|
|
|  |
|
|
| --- |
|
|
| ## Model architecture |
|
|
| OVIE is a convolutional encoder–decoder with a Vision Transformer (ViT) bottleneck conditioned on camera parameters via adaptive layer normalisation (AdaLN): |
|
|
| - **Encoder**: cascaded downsampling ConvBlocks (3 scales) |
| - **Bottleneck**: 12-layer ViT (hidden size 768, 12 heads) with AdaLN camera conditioning |
| - **Decoder**: cascaded upsampling ConvBlocks (3 scales) |
| - **Camera conditioning**: a 7-dimensional pose encoding (rotation + translation) projected into the ViT hidden space |
| - **Parameters**: ~143M |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from models.models import OVIEModel |
| from utils.pose_enc import extri_intri_to_pose_encoding |
| from torchvision.transforms import ToTensor |
| from PIL import Image |
| |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| |
| # Load model |
| model = OVIEModel.from_pretrained("kyutai/ovie", revision="v1.0").to(device) |
| model.eval() |
| image_size = model.image_size # 256 |
| |
| # Prepare input image |
| img_pil = Image.open("image.jpg").convert("RGB").resize((image_size, image_size)) |
| img_tensor = ToTensor()(img_pil).unsqueeze(0).to(device) |
| |
| # Define target camera pose (3x4 extrinsics) |
| extrinsics = torch.tensor([[[1.0, 0.0, 0.0, -1.25], |
| [0.0, 1.0, 0.0, 0.5], |
| [0.0, 0.0, 1.0, -2.0]]], device=device) |
| dummy_intrinsics = torch.zeros(1, 1, 3, 3, device=device) |
| |
| camera = extri_intri_to_pose_encoding( |
| extrinsics=extrinsics.unsqueeze(0), |
| intrinsics=dummy_intrinsics, |
| image_size_hw=(image_size, image_size), |
| ) |
| cam_token = camera[..., :7].squeeze(0) |
| |
| # Generate novel view |
| with torch.no_grad(): |
| pred = model(x=img_tensor, cam_params=cam_token) |
| # pred: (1, 3, 256, 256) tensor in [0, 1] |
| ``` |
|
|
| See the repository for full installation instructions and example notebooks: |
| - `inference_huggingface.ipynb` — loads directly from this Hub page |
| - `inference_local.ipynb` — loads from a local checkpoint |
|
|
| --- |
|
|
| ## Training |
|
|
| OVIE is trained on a diverse mix of in-the-wild internet images (ImageNet, Places365, OSV5M, OpenImages) with **no multi-view pairs**. Training uses a combination of L2 reconstruction loss, LPIPS perceptual loss, and an adversarial loss with a DINO-based discriminator. Camera poses are sampled synthetically from a distribution of plausible viewpoint changes. |
|
|
| --- |
|
|
| ## Evaluation |
|
|
| The model is evaluated on DL3DV and Real Estate 10K (RE10K) using PSNR, SSIM, and LPIPS. See the [paper](https://arxiv.org/abs/2603.23488) for full quantitative results. |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{ovie2026, |
| title={One View Is Enough! Monocular Training for In-the-Wild Novel View Generation}, |
| author={Adrien Ramanana Rahary and Nicolas Dufour and Patrick Perez and David Picard}, |
| year={2026}, |
| eprint={2603.23488}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2603.23488}, |
| } |
| ``` |
|
|