Duoduo CLIP: Efficient 3D Understanding with Multi-View Images

ICLR 2025

Han-Hung Lee^1*, Yiming Zhang^1*, Angel Xuan Chang^1,2
^* Equal contribution ¹ Simon Fraser University ² Alberta Machine Intelligence Institute (Amii)

This repository provides the Hugging Face transformers version of Duoduo CLIP: Efficient 3D Understanding with Multi-View Images. DuoduoCLIP learns 3D object representations from multi-view rendered images instead of point clouds, while keeping the familiar CLIP-style text and image embedding interface.

The model supports text prompts, single-view images, and multi-view image sets. Multi-view inputs are aggregated into one feature vector per object, making the model useful for text-to-3D retrieval, image-to-3D retrieval, and multi-view 3D representation extraction.

Quick Start

import torch
import torch.nn.functional as F
from huggingface_hub import hf_hub_download
from PIL import Image
from transformers import AutoModel, AutoProcessor

repo_id = "3dlg-hcvc/DuoduoCLIP-B-32"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    repo_id,
    trust_remote_code=True,
    attn_implementation="sdpa",
    dtype=torch.float32,
    device_map=device,
)
processor = AutoProcessor.from_pretrained(repo_id)
model.eval()

def load_demo_shape():
    view_paths = [
        hf_hub_download(repo_id=repo_id, filename=f"assets/demo/view_{view_id:03d}.png")
        for view_id in range(3)
    ]
    return [Image.open(path).convert("RGB") for path in view_paths]

@torch.inference_mode()
def encode_shape(view_images):
    image_inputs = processor(images=view_images, return_tensors="pt").to(device)
    pixel_values = image_inputs["pixel_values"].unsqueeze(0)
    return F.normalize(model.get_image_features(pixel_values=pixel_values), dim=-1)

@torch.inference_mode()
def encode_text(texts):
    text_inputs = processor(text=texts, return_tensors="pt", padding=True).to(device)
    return F.normalize(model.get_text_features(**text_inputs), dim=-1)

Text-Shape Retrieval

texts = ["a 3D model of a telephone", "a 3D model of a chair", "a 3D model of a sofa"]
text_features = encode_text(texts)

shape_features = encode_shape(load_demo_shape())
text_shape_scores = shape_features @ text_features.T
print(text_shape_scores.softmax(dim=-1))

Shape-Shape Retrieval

query_shape = load_demo_shape()[:2]
gallery_names = ["telephone"]
gallery_shapes = [load_demo_shape()]

query_features = encode_shape(query_shape)
gallery_features = torch.cat([encode_shape(shape) for shape in gallery_shapes], dim=0)
shape_shape_scores = query_features @ gallery_features.T
print(gallery_names[shape_shape_scores.argmax(dim=-1).item()])

Input Formats

DuoduoCLIP accepts the same processor outputs as CLIP for text and images. Image features returned by get_image_features are unnormalized projected features, so normalize them before retrieval.

Multi-View Image Set

Pass view images through the processor and reshape to (B, F, 3, H, W), where F is the number of views:

image_inputs = processor(images=view_images, return_tensors="pt").to(device)
pixel_values = image_inputs["pixel_values"].unsqueeze(0)
image_features = F.normalize(model.get_image_features(pixel_values=pixel_values), dim=-1)

A flattened (B * F, 3, H, W) tensor is also supported when num_views=F is provided.

Single-View Image

image_inputs = processor(images=view_images[0], return_tensors="pt").to(device)
image_features = F.normalize(model.get_image_features(**image_inputs), dim=-1)

Notes

get_text_features and get_image_features follow Hugging Face CLIP semantics and return unnormalized projected features. Use torch.nn.functional.normalize before computing retrieval similarities. Calling the full model forward returns normalized text_embeds, normalized image_embeds, and CLIP-style logits.

This repository includes the CLIP processor/tokenizer files so users can load both model and processor directly from the same model repo.

Citation

@inproceedings{lee2025duoduo,
    title={Duoduo CLIP: Efficient 3D understanding with multi-view images},
    author={Lee, Han-Hung and Zhang, Yiming and Chang, Angel},
    booktitle={International Conference on Learning Representations},
    volume={2025},
    pages={48070--48091},
    year={2025}
}