SAE Trait Annotation for Organismal Images

Sparse Autoencoder (SAE) checkpoint from the ICLR 2026 paper: Automatic Image-Level Morphological Trait Annotation for Organismal Images

Model Details

Model Description

Developed by: Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su
Model type: Sparse Autoencoder (SAE) trained on DINOv2 ViT-B/14 patch activations
Language(s): English (natural language trait output via downstream MLLM)
License: MIT
Fine-tuned from model: N/A — trained from scratch on DINOv2 activations (DINOv2 backbone is frozen)

This SAE is trained on penultimate-layer activations of a DINOv2 ViT-B/14 model applied to insect images from BIOSCAN-5M. Its latents capture interpretable visual features that correspond to species-level morphological traits (e.g., wing venation, body coloration, antennal structure). These latents are used to steer a multimodal LLM (Qwen2.5-VL-72B) into generating natural-language trait annotations.

Architecture:

Base encoder: DINOv2 ViT-B/14 (frozen), activations from layer -2
SAE input dimension (d-vit): 768
Expansion factor: 32 → 24,576 latent dimensions
Training data: patch-level activations from BIOSCAN-5M

Model Sources

Repository: OSU-NLP-Group/sae-trait-annotation
Paper: Automatic Image-Level Morphological Trait Annotation for Organismal Images
Project website: osu-nlp-group.github.io/sae-trait-annotation
Dataset: osunlp/bioscan-traits

Uses

Direct Use

Encoding insect specimen images to obtain sparse, interpretable feature representations.
Identifying species-level salient latent dimensions (body-part detectors) for morphological analysis.
Generating part-level localization maps (wings, legs, antennae, etc.) without pixel-level supervision.
Steering a multimodal LLM to produce structured morphological trait descriptions (as in the BIOSCAN-Traits pipeline).

Downstream Use

Fine-grained insect species classification using trait-level supervision.
Biodiversity informatics and ecological monitoring applications requiring interpretable visual features.
Interpretability research on vision foundation models (e.g., analyzing what DINOv2 learns about morphological traits).

Out-of-Scope Use

Non-insect imagery without retraining: The SAE was trained on insect photographs from BIOSCAN-5M; latent dimensions reflect insect-specific morphology and may not be meaningful for other taxa or image domains without retraining.
High-stakes identification without expert review: SAE-derived trait annotations are automatically generated and have not been validated for regulatory, conservation-management, or legal purposes.
Direct classification without downstream adaptation: The SAE produces feature representations, not species labels. Thus, it is not a standalone classifier.

Bias, Risks, and Limitations

Taxonomic bias: The SAE was trained exclusively on insect images from BIOSCAN-5M, which has uneven taxonomic and geographic coverage. Latent dimensions are optimized for species represented in that dataset and may underperform on underrepresented taxa.

Patch-level localization artifacts: Part-level localization is inferred from patch activations and is not pixel-precise. Overlapping or small body parts may not be reliably localized.

MLLM hallucination: When used in the full trait-annotation pipeline, errors propagate from the MLLM (hallucinated or generic descriptions).

Image preprocessing dependence: The model expects images preprocessed to match the DINOv2 ViT-B/14 input format (224×224, standard ImageNet normalization). Out-of-distribution preprocessing will degrade activation quality.

Recommendations

Re-calibrate the activation threshold on a held-out set when applying to new image domains.
Use the model as a tool to assist expert annotation rather than as a replacement for trained entomologists.

Getting Started

Clone the code repository (which vendors the saev library), then load and run the SAE as follows:

import torch
import saev.nn
import saev.activations
from torchvision import datasets
from torch.utils.data import DataLoader

device = "cuda" if torch.cuda.is_available() else "cpu"

# Build the image transform and DINOv2 ViT-B/14 backbone
img_transform = saev.activations.make_img_transform("dinov2", "sae.pt")
vit = saev.activations.make_vit("dinov2", "dinov2_vitb14")

# Wrap the ViT to record activations from layer 10 (penultimate), 256 patches
recorded_vit = saev.activations.RecordedVisionTransformer(
    vit, n_patches=256, cls_token=True, layers=[10]
).to(device)

# Load the SAE checkpoint
sae = saev.nn.load("sae.pt").to(device)
sae.eval()

# --- Encode a batch of images ---
dataset = datasets.ImageFolder(root="/path/to/images/train")

def collate_fn(batch):
    images, labels = zip(*batch)
    return list(images), torch.tensor(labels)

loader = DataLoader(dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)

with torch.no_grad():
    for images, labels in loader:
        images_t = torch.stack(img_transform(images)).to(device)

        # vit_acts: (batch, n_layers, n_patches+1, d_vit)
        _, vit_acts = recorded_vit(images_t)

        # Select layer 0 of the recorded layers, drop the CLS token
        vit_acts = vit_acts[:, 0, 1:, :]   # (batch, 256, 768)

        # SAE forward: returns (reconstruction, features, aux)
        _, f_x, _ = sae(vit_acts)           # f_x: (batch, 256, 24576)

        # Threshold activations to find active latents (default thresh=0.9)
        active = (f_x > 0.9)               # (batch, 256, 24576) bool

The active latent indices per patch identify which SAE dimensions fire on each image region. These are used downstream to find species-prominent latents and generate trait annotations via an MLLM. See create_trait_dataset_mllm_sae.py for the full pipeline.

Training Details

Training Data

The SAE was trained on patch-level activations extracted from BIOSCAN-5M insect images. Images were preprocessed into ImageFolder layout and activations were dumped from layer -2 of a frozen DINOv2 ViT-B/14 backbone (256 patches per image, CLS token excluded).

Training Procedure

Preprocessing

Raw BIOSCAN-5M images were resized and normalized to the DINOv2 ViT-B/14 input format (224×224, ImageNet mean/std). Patch activations were extracted in shards and stored on disk before SAE training.

Training Hyperparameters

Hyperparameter	Value
Learning rate	1e-3
Sparsity coefficient (α)	4e-4
SAE input dimension	768
Expansion factor	32 (→ 24,576 latents)
Patch mode	patch-level (256 patches/image)
Scale mean	False
Scale norm	False

Speeds, Sizes, Times

All experiments were run on 2× NVIDIA H100 80GB GPUs at the Ohio Supercomputer Center (OSC). The table below reports end-to-end pipeline runtime measured on the BIOSCAN-Traits workload (Section 4.5 / Table 4 of the paper):

Task	Time	Notes
Activation computation (1 image)	2.74 ms	DINOv2 ViT-B/14 backbone
SAE forward pass (1 image)	4.53 ms	Sparse Autoencoder
Total preprocessing (1 image)	7.26 ms	Feature extraction + SAE
MLLM inference (3 images / annotation)	4.62 s	Qwen2.5-VL-72B
Throughput	208.9 annotations / h / GPU	2× H100 80GB

The SAE preprocessing step (7.26 ms/image) is negligible relative to MLLM inference (4.62 s/annotation), which dominates the pipeline cost.

Evaluation

Testing Data

Species-specificity scores were computed across the BIOSCAN-5M training distribution to identify latent dimensions that activate strongly and selectively for individual species. Downstream evaluation of trait annotation quality was performed via human expert review.

Metrics

Trait annotation quality: Assessed qualitatively by human annotators on a Likert scale (1-5), see paper for details.

Results

Please refer to the paper (arxiv.org/abs/2604.01619) for full evaluation results, including species-level trait annotation quality, downstream classification performance with BioCLIP, and ablation studies.

Environmental Impact

All experiments were performed on 2× NVIDIA H100 80GB GPUs at the Ohio Supercomputer Center (OSC). The pipeline achieves a throughput of 208.9 annotations/h/GPU; generating the full 80,806-sample BIOSCAN-Traits dataset required roughly 194 GPU-hours of MLLM inference (Qwen2.5-VL-72B at 4.62 s/annotation). SAE training and DINOv2 activation extraction added a comparatively small overhead (7.26 ms preprocessing per image). Carbon emission estimates were not tracked; OSC uses a mix of energy sources typical of Midwestern US grid infrastructure.

Technical Specifications

Model Architecture

The SAE follows the standard sparse dictionary learning architecture:

Encoder: Linear projection from d_vit=768 → d_sae=24576 followed by ReLU activation
Decoder: Linear projection from d_sae=24576 → d_vit=768 (columns normalized to unit norm)
Sparsity: L1 penalty on encoder activations weighted by coefficient α=4e-4
Input: Patch-level DINOv2 ViT-B/14 activations from layer -2 (penultimate transformer block)

Compute Infrastructure

Hardware: 2× NVIDIA H100 80GB GPUs at the Ohio Supercomputer Center
Framework: PyTorch, via the SAEV library (vendored in the code repository)
DINOv2 backbone: dinov2_vitb14 (frozen, loaded via torch.hub)

Citation

@inproceedings{
  pahuja2026automatic,
  title={Automatic Image-Level Morphological Trait Annotation for Organismal Images},
  author={Vardaan Pahuja and Samuel Stevens and Alyson East and Sydne Record and Yu Su},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=oFRbiaib5Q}
}

Please also cite the source dataset:

@inproceedings{gharaee2024bioscan5m,
    title={{BIOSCAN-5M}: A Multimodal Dataset for Insect Biodiversity},
    booktitle={Advances in Neural Information Processing Systems},
    author={Zahra Gharaee and Scott C. Lowe and ZeMing Gong and Pablo Millan Arias and Nicholas Pellegrino and Austin T. Wang and Joakim Bruslund Haurum and Iuliia Zarubiieva and Lila Kari and Dirk Steinke and Graham W. Taylor and Paul Fieguth and Angel X. Chang},
    editor={A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
    pages={36285--36313},
    publisher={Curran Associates, Inc.},
    year={2024},
    volume={37},
    url={https://proceedings.neurips.cc/paper_files/paper/2024/file/3fdbb472813041c9ecef04c20c2b1e5a-Paper-Datasets_and_Benchmarks_Track.pdf},
}

Glossary

SAE (Sparse Autoencoder): A neural network trained to decompose dense activations into sparse, interpretable latent dimensions; here used to identify body-part detectors in DINOv2 features.
DINOv2: A self-supervised vision transformer trained via self-distillation; used as a frozen feature extractor.
Latent dimension / feature: A single neuron in the SAE's expanded representation; high-activation latents correspond to interpretable visual concepts (e.g., wing venation, leg color).
Species-specificity score: A metric quantifying how selectively a latent activates for one species vs. the full population; used to identify morphologically diagnostic features.
MLLM (Multimodal Large Language Model): A large language model capable of processing both images and text; here Qwen2.5-VL-72B is used to verbalize SAE-identified part activations into natural language trait descriptions.
Morphological trait: An observable characteristic of an organism's physical form (e.g., wing shape, antenna length, body coloration).
BIOSCAN-5M: The large-scale source dataset of ~5 million insect specimen images used to train the SAE.

Acknowledgments

Code

SAEV for sparse autoencoder training infrastructure.
BioCLIP for downstream training/evaluation tooling.

Funding

This research was supported in part by NSF CAREER #2443149, NSF OAC 2118240, and an Alfred P. Sloan Foundation Fellowship. Computational resources were provided by the Ohio Supercomputer Center.

S. Record and A. East were additionally supported by NSF Award No. 242918 (EPSCOR Research Fellows: Advancing NEON-Enabled Science and Workforce Development at the University of Maine with AI) and Hatch project Award #MEO-022425 from the USDA National Institute of Food and Agriculture. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or the US Department of Agriculture.

People

We thank colleagues in the OSU NLP group for valuable feedback. This work was in part conceived at Funcapalooza.

Model Card Authors

Vardaan Pahuja

Model Card Contact

Vardaan Pahuja (vardaanpahuja@gmail.com)

Downloads last month: -; Downloads are not tracked for this model. How to track

Datasets used to train osunlp/sae-trait-annotation

Paper for osunlp/sae-trait-annotation

Automatic Image-Level Morphological Trait Annotation for Organismal Images

Paper • 2604.01619 • Published Apr 2 • 6