SAE Trait Annotation for Organismal Images
Sparse Autoencoder (SAE) checkpoint from the ICLR 2026 paper: Automatic Image-Level Morphological Trait Annotation for Organismal Images
Model Details
Model Description
- Developed by: Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su
- Model type: Sparse Autoencoder (SAE) trained on DINOv2 ViT-B/14 patch activations
- Language(s): English (natural language trait output via downstream MLLM)
- License: MIT
- Fine-tuned from model: N/A — trained from scratch on DINOv2 activations (DINOv2 backbone is frozen)
This SAE is trained on penultimate-layer activations of a DINOv2 ViT-B/14 model applied to insect images from BIOSCAN-5M. Its latents capture interpretable visual features that correspond to species-level morphological traits (e.g., wing venation, body coloration, antennal structure). These latents are used to steer a multimodal LLM (Qwen2.5-VL-72B) into generating natural-language trait annotations.
Architecture:
- Base encoder: DINOv2 ViT-B/14 (frozen), activations from layer
-2 - SAE input dimension (
d-vit): 768 - Expansion factor: 32 → 24,576 latent dimensions
- Training data: patch-level activations from BIOSCAN-5M
Model Sources
- Repository: OSU-NLP-Group/sae-trait-annotation
- Paper: Automatic Image-Level Morphological Trait Annotation for Organismal Images
- Project website: osu-nlp-group.github.io/sae-trait-annotation
- Dataset: osunlp/bioscan-traits
Uses
Direct Use
- Encoding insect specimen images to obtain sparse, interpretable feature representations.
- Identifying species-level salient latent dimensions (body-part detectors) for morphological analysis.
- Generating part-level localization maps (wings, legs, antennae, etc.) without pixel-level supervision.
- Steering a multimodal LLM to produce structured morphological trait descriptions (as in the BIOSCAN-Traits pipeline).
Downstream Use
- Fine-grained insect species classification using trait-level supervision.
- Biodiversity informatics and ecological monitoring applications requiring interpretable visual features.
- Interpretability research on vision foundation models (e.g., analyzing what DINOv2 learns about morphological traits).
Out-of-Scope Use
- Non-insect imagery without retraining: The SAE was trained on insect photographs from BIOSCAN-5M; latent dimensions reflect insect-specific morphology and may not be meaningful for other taxa or image domains without retraining.
- High-stakes identification without expert review: SAE-derived trait annotations are automatically generated and have not been validated for regulatory, conservation-management, or legal purposes.
- Direct classification without downstream adaptation: The SAE produces feature representations, not species labels. Thus, it is not a standalone classifier.
Bias, Risks, and Limitations
Taxonomic bias: The SAE was trained exclusively on insect images from BIOSCAN-5M, which has uneven taxonomic and geographic coverage. Latent dimensions are optimized for species represented in that dataset and may underperform on underrepresented taxa.
Patch-level localization artifacts: Part-level localization is inferred from patch activations and is not pixel-precise. Overlapping or small body parts may not be reliably localized.
MLLM hallucination: When used in the full trait-annotation pipeline, errors propagate from the MLLM (hallucinated or generic descriptions).
Image preprocessing dependence: The model expects images preprocessed to match the DINOv2 ViT-B/14 input format (224×224, standard ImageNet normalization). Out-of-distribution preprocessing will degrade activation quality.
Recommendations
- Re-calibrate the activation threshold on a held-out set when applying to new image domains.
- Use the model as a tool to assist expert annotation rather than as a replacement for trained entomologists.
Getting Started
Clone the code repository (which vendors the saev library), then load and run the SAE as follows:
import torch
import saev.nn
import saev.activations
from torchvision import datasets
from torch.utils.data import DataLoader
device = "cuda" if torch.cuda.is_available() else "cpu"
# Build the image transform and DINOv2 ViT-B/14 backbone
img_transform = saev.activations.make_img_transform("dinov2", "sae.pt")
vit = saev.activations.make_vit("dinov2", "dinov2_vitb14")
# Wrap the ViT to record activations from layer 10 (penultimate), 256 patches
recorded_vit = saev.activations.RecordedVisionTransformer(
vit, n_patches=256, cls_token=True, layers=[10]
).to(device)
# Load the SAE checkpoint
sae = saev.nn.load("sae.pt").to(device)
sae.eval()
# --- Encode a batch of images ---
dataset = datasets.ImageFolder(root="/path/to/images/train")
def collate_fn(batch):
images, labels = zip(*batch)
return list(images), torch.tensor(labels)
loader = DataLoader(dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)
with torch.no_grad():
for images, labels in loader:
images_t = torch.stack(img_transform(images)).to(device)
# vit_acts: (batch, n_layers, n_patches+1, d_vit)
_, vit_acts = recorded_vit(images_t)
# Select layer 0 of the recorded layers, drop the CLS token
vit_acts = vit_acts[:, 0, 1:, :] # (batch, 256, 768)
# SAE forward: returns (reconstruction, features, aux)
_, f_x, _ = sae(vit_acts) # f_x: (batch, 256, 24576)
# Threshold activations to find active latents (default thresh=0.9)
active = (f_x > 0.9) # (batch, 256, 24576) bool
The active latent indices per patch identify which SAE dimensions fire on each image region. These are used downstream to find species-prominent latents and generate trait annotations via an MLLM. See create_trait_dataset_mllm_sae.py for the full pipeline.
Training Details
Training Data
The SAE was trained on patch-level activations extracted from BIOSCAN-5M insect images. Images were preprocessed into ImageFolder layout and activations were dumped from layer -2 of a frozen DINOv2 ViT-B/14 backbone (256 patches per image, CLS token excluded).
Training Procedure
Preprocessing
Raw BIOSCAN-5M images were resized and normalized to the DINOv2 ViT-B/14 input format (224×224, ImageNet mean/std). Patch activations were extracted in shards and stored on disk before SAE training.
Training Hyperparameters
| Hyperparameter | Value |
|---|---|
| Learning rate | 1e-3 |
| Sparsity coefficient (α) | 4e-4 |
| SAE input dimension | 768 |
| Expansion factor | 32 (→ 24,576 latents) |
| Patch mode | patch-level (256 patches/image) |
| Scale mean | False |
| Scale norm | False |
Speeds, Sizes, Times
All experiments were run on 2× NVIDIA H100 80GB GPUs at the Ohio Supercomputer Center (OSC). The table below reports end-to-end pipeline runtime measured on the BIOSCAN-Traits workload (Section 4.5 / Table 4 of the paper):
| Task | Time | Notes |
|---|---|---|
| Activation computation (1 image) | 2.74 ms | DINOv2 ViT-B/14 backbone |
| SAE forward pass (1 image) | 4.53 ms | Sparse Autoencoder |
| Total preprocessing (1 image) | 7.26 ms | Feature extraction + SAE |
| MLLM inference (3 images / annotation) | 4.62 s | Qwen2.5-VL-72B |
| Throughput | 208.9 annotations / h / GPU | 2× H100 80GB |
The SAE preprocessing step (7.26 ms/image) is negligible relative to MLLM inference (4.62 s/annotation), which dominates the pipeline cost.
Evaluation
Testing Data
Species-specificity scores were computed across the BIOSCAN-5M training distribution to identify latent dimensions that activate strongly and selectively for individual species. Downstream evaluation of trait annotation quality was performed via human expert review.
Metrics
- Trait annotation quality: Assessed qualitatively by human annotators on a Likert scale (1-5), see paper for details.
Results
Please refer to the paper (arxiv.org/abs/2604.01619) for full evaluation results, including species-level trait annotation quality, downstream classification performance with BioCLIP, and ablation studies.
Environmental Impact
All experiments were performed on 2× NVIDIA H100 80GB GPUs at the Ohio Supercomputer Center (OSC). The pipeline achieves a throughput of 208.9 annotations/h/GPU; generating the full 80,806-sample BIOSCAN-Traits dataset required roughly 194 GPU-hours of MLLM inference (Qwen2.5-VL-72B at 4.62 s/annotation). SAE training and DINOv2 activation extraction added a comparatively small overhead (7.26 ms preprocessing per image). Carbon emission estimates were not tracked; OSC uses a mix of energy sources typical of Midwestern US grid infrastructure.
Technical Specifications
Model Architecture
The SAE follows the standard sparse dictionary learning architecture:
- Encoder: Linear projection from
d_vit=768→d_sae=24576followed by ReLU activation - Decoder: Linear projection from
d_sae=24576→d_vit=768(columns normalized to unit norm) - Sparsity: L1 penalty on encoder activations weighted by coefficient α=4e-4
- Input: Patch-level DINOv2 ViT-B/14 activations from layer
-2(penultimate transformer block)
Compute Infrastructure
- Hardware: 2× NVIDIA H100 80GB GPUs at the Ohio Supercomputer Center
- Framework: PyTorch, via the SAEV library (vendored in the code repository)
- DINOv2 backbone:
dinov2_vitb14(frozen, loaded viatorch.hub)
Citation
@inproceedings{
pahuja2026automatic,
title={Automatic Image-Level Morphological Trait Annotation for Organismal Images},
author={Vardaan Pahuja and Samuel Stevens and Alyson East and Sydne Record and Yu Su},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=oFRbiaib5Q}
}
Please also cite the source dataset:
@inproceedings{gharaee2024bioscan5m,
title={{BIOSCAN-5M}: A Multimodal Dataset for Insect Biodiversity},
booktitle={Advances in Neural Information Processing Systems},
author={Zahra Gharaee and Scott C. Lowe and ZeMing Gong and Pablo Millan Arias and Nicholas Pellegrino and Austin T. Wang and Joakim Bruslund Haurum and Iuliia Zarubiieva and Lila Kari and Dirk Steinke and Graham W. Taylor and Paul Fieguth and Angel X. Chang},
editor={A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
pages={36285--36313},
publisher={Curran Associates, Inc.},
year={2024},
volume={37},
url={https://proceedings.neurips.cc/paper_files/paper/2024/file/3fdbb472813041c9ecef04c20c2b1e5a-Paper-Datasets_and_Benchmarks_Track.pdf},
}
Glossary
- SAE (Sparse Autoencoder): A neural network trained to decompose dense activations into sparse, interpretable latent dimensions; here used to identify body-part detectors in DINOv2 features.
- DINOv2: A self-supervised vision transformer trained via self-distillation; used as a frozen feature extractor.
- Latent dimension / feature: A single neuron in the SAE's expanded representation; high-activation latents correspond to interpretable visual concepts (e.g., wing venation, leg color).
- Species-specificity score: A metric quantifying how selectively a latent activates for one species vs. the full population; used to identify morphologically diagnostic features.
- MLLM (Multimodal Large Language Model): A large language model capable of processing both images and text; here Qwen2.5-VL-72B is used to verbalize SAE-identified part activations into natural language trait descriptions.
- Morphological trait: An observable characteristic of an organism's physical form (e.g., wing shape, antenna length, body coloration).
- BIOSCAN-5M: The large-scale source dataset of ~5 million insect specimen images used to train the SAE.
Acknowledgments
Code
- SAEV for sparse autoencoder training infrastructure.
- BioCLIP for downstream training/evaluation tooling.
Funding
This research was supported in part by NSF CAREER #2443149, NSF OAC 2118240, and an Alfred P. Sloan Foundation Fellowship. Computational resources were provided by the Ohio Supercomputer Center.
S. Record and A. East were additionally supported by NSF Award No. 242918 (EPSCOR Research Fellows: Advancing NEON-Enabled Science and Workforce Development at the University of Maine with AI) and Hatch project Award #MEO-022425 from the USDA National Institute of Food and Agriculture. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or the US Department of Agriculture.
People
We thank colleagues in the OSU NLP group for valuable feedback. This work was in part conceived at Funcapalooza.
Model Card Authors
Vardaan Pahuja
Model Card Contact
Vardaan Pahuja (vardaanpahuja@gmail.com)