--- license: mit tags: - novel-view-synthesis - image-to-image - computer-vision - pytorch - model_hub_mixin - pytorch_model_hub_mixin pipeline_tag: image-to-image arxiv: 2603.23488 --- # OVIE — One View Is Enough! **Monocular Training for In-the-Wild Novel View Generation** [![Project Page](https://img.shields.io/badge/Project_Page-green?logo=googlechrome&logoColor=white)](https://kyutai.org/blog/2026-04-14-ovie) [![Paper](https://img.shields.io/badge/arXiv-2603.23488-red?logo=arxiv)](https://arxiv.org/abs/2603.23488) [![GitHub](https://img.shields.io/badge/GitHub-AdrienRR%2Fovie-black?logo=github)](https://github.com/AdrienRR/ovie) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/AdrienRR/ovie/blob/main/LICENSE) OVIE is a novel view synthesis model that generates a new viewpoint of a scene from a **single image** and a **target camera pose**. Unlike most prior work, it is trained entirely on **unpaired in-the-wild images** — no multi-view supervision required. ![OVIE teaser](https://raw.githubusercontent.com/AdrienRR/ovie/main/assets/teaser.jpeg) --- ## Model architecture OVIE is a convolutional encoder–decoder with a Vision Transformer (ViT) bottleneck conditioned on camera parameters via adaptive layer normalisation (AdaLN): - **Encoder**: cascaded downsampling ConvBlocks (3 scales) - **Bottleneck**: 12-layer ViT (hidden size 768, 12 heads) with AdaLN camera conditioning - **Decoder**: cascaded upsampling ConvBlocks (3 scales) - **Camera conditioning**: a 7-dimensional pose encoding (rotation + translation) projected into the ViT hidden space - **Parameters**: ~143M --- ## Usage ```python import torch from models.models import OVIEModel from utils.pose_enc import extri_intri_to_pose_encoding from torchvision.transforms import ToTensor from PIL import Image device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load model model = OVIEModel.from_pretrained("kyutai/ovie", revision="v1.0").to(device) model.eval() image_size = model.image_size # 256 # Prepare input image img_pil = Image.open("image.jpg").convert("RGB").resize((image_size, image_size)) img_tensor = ToTensor()(img_pil).unsqueeze(0).to(device) # Define target camera pose (3x4 extrinsics) extrinsics = torch.tensor([[[1.0, 0.0, 0.0, -1.25], [0.0, 1.0, 0.0, 0.5], [0.0, 0.0, 1.0, -2.0]]], device=device) dummy_intrinsics = torch.zeros(1, 1, 3, 3, device=device) camera = extri_intri_to_pose_encoding( extrinsics=extrinsics.unsqueeze(0), intrinsics=dummy_intrinsics, image_size_hw=(image_size, image_size), ) cam_token = camera[..., :7].squeeze(0) # Generate novel view with torch.no_grad(): pred = model(x=img_tensor, cam_params=cam_token) # pred: (1, 3, 256, 256) tensor in [0, 1] ``` See the repository for full installation instructions and example notebooks: - `inference_huggingface.ipynb` — loads directly from this Hub page - `inference_local.ipynb` — loads from a local checkpoint --- ## Training OVIE is trained on a diverse mix of in-the-wild internet images (ImageNet, Places365, OSV5M, OpenImages) with **no multi-view pairs**. Training uses a combination of L2 reconstruction loss, LPIPS perceptual loss, and an adversarial loss with a DINO-based discriminator. Camera poses are sampled synthetically from a distribution of plausible viewpoint changes. --- ## Evaluation The model is evaluated on DL3DV and Real Estate 10K (RE10K) using PSNR, SSIM, and LPIPS. See the [paper](https://arxiv.org/abs/2603.23488) for full quantitative results. --- ## Citation ```bibtex @misc{ovie2026, title={One View Is Enough! Monocular Training for In-the-Wild Novel View Generation}, author={Adrien Ramanana Rahary and Nicolas Dufour and Patrick Perez and David Picard}, year={2026}, eprint={2603.23488}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2603.23488}, } ```