ovie / README.md

Update README.md

294b645 verified about 2 months ago

4.01 kB

	---
	license: mit
	tags:
	- novel-view-synthesis
	- image-to-image
	- computer-vision
	- pytorch
	- model_hub_mixin
	- pytorch_model_hub_mixin
	pipeline_tag: image-to-image
	arxiv: 2603.23488
	---

	# OVIE — One View Is Enough!

	Monocular Training for In-the-Wild Novel View Generation

	[![Project Page](https://img.shields.io/badge/Project_Page-green?logo=googlechrome&logoColor=white)](https://kyutai.org/blog/2026-04-14-ovie)
	[![Paper](https://img.shields.io/badge/arXiv-2603.23488-red?logo=arxiv)](https://arxiv.org/abs/2603.23488)
	[![GitHub](https://img.shields.io/badge/GitHub-AdrienRR%2Fovie-black?logo=github)](https://github.com/AdrienRR/ovie)
	[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/AdrienRR/ovie/blob/main/LICENSE)

	OVIE is a novel view synthesis model that generates a new viewpoint of a scene from a single image and a target camera pose. Unlike most prior work, it is trained entirely on unpaired in-the-wild images — no multi-view supervision required.

	![OVIE teaser](https://raw.githubusercontent.com/AdrienRR/ovie/main/assets/teaser.jpeg)

	---

	## Model architecture

	OVIE is a convolutional encoder–decoder with a Vision Transformer (ViT) bottleneck conditioned on camera parameters via adaptive layer normalisation (AdaLN):

	- Encoder: cascaded downsampling ConvBlocks (3 scales)
	- Bottleneck: 12-layer ViT (hidden size 768, 12 heads) with AdaLN camera conditioning
	- Decoder: cascaded upsampling ConvBlocks (3 scales)
	- Camera conditioning: a 7-dimensional pose encoding (rotation + translation) projected into the ViT hidden space
	- Parameters: ~143M

	---

	## Usage

	```python
	import torch
	from models.models import OVIEModel
	from utils.pose_enc import extri_intri_to_pose_encoding
	from torchvision.transforms import ToTensor
	from PIL import Image

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Load model
	model = OVIEModel.from_pretrained("kyutai/ovie", revision="v1.0").to(device)
	model.eval()
	image_size = model.image_size # 256

	# Prepare input image
	img_pil = Image.open("image.jpg").convert("RGB").resize((image_size, image_size))
	img_tensor = ToTensor()(img_pil).unsqueeze(0).to(device)

	# Define target camera pose (3x4 extrinsics)
	extrinsics = torch.tensor([[[1.0, 0.0, 0.0, -1.25],
	[0.0, 1.0, 0.0, 0.5],
	[0.0, 0.0, 1.0, -2.0]]], device=device)
	dummy_intrinsics = torch.zeros(1, 1, 3, 3, device=device)

	camera = extri_intri_to_pose_encoding(
	extrinsics=extrinsics.unsqueeze(0),
	intrinsics=dummy_intrinsics,
	image_size_hw=(image_size, image_size),
	)
	cam_token = camera[..., :7].squeeze(0)

	# Generate novel view
	with torch.no_grad():
	pred = model(x=img_tensor, cam_params=cam_token)
	# pred: (1, 3, 256, 256) tensor in [0, 1]
	```

	See the repository for full installation instructions and example notebooks:
	- `inference_huggingface.ipynb` — loads directly from this Hub page
	- `inference_local.ipynb` — loads from a local checkpoint

	---

	## Training

	OVIE is trained on a diverse mix of in-the-wild internet images (ImageNet, Places365, OSV5M, OpenImages) with no multi-view pairs. Training uses a combination of L2 reconstruction loss, LPIPS perceptual loss, and an adversarial loss with a DINO-based discriminator. Camera poses are sampled synthetically from a distribution of plausible viewpoint changes.

	---

	## Evaluation

	The model is evaluated on DL3DV and Real Estate 10K (RE10K) using PSNR, SSIM, and LPIPS. See the [paper](https://arxiv.org/abs/2603.23488) for full quantitative results.

	---

	## Citation

	```bibtex
	@misc{ovie2026,
	title={One View Is Enough! Monocular Training for In-the-Wild Novel View Generation},
	author={Adrien Ramanana Rahary and Nicolas Dufour and Patrick Perez and David Picard},
	year={2026},
	eprint={2603.23488},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2603.23488},
	}
	```