sattxt / README.md

ikhado

Update README.md

7b40cf6 verified 25 days ago

preview code

raw

history blame contribute delete

4.73 kB

metadata

license: llama3
datasets:
  - ibm-esa-geospatial/Llama3-SSL4EO-S12-v1.1-captions
  - embed2scale/SSL4EO-S12-v1.1
language:
  - en
base_model:
  - meta-llama/Meta-Llama-3-8B
  - facebook/dinov3-vit7b16-pretrain-sat493m
  - McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-unsup-simcse
tags:
  - earth-observation
  - satellite-imagery
  - remote-sensing

SATtxt - Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

SATtxt

Minh Kha Do, Wei Xiang, Kang Han, Di Wu, Khoa Phan, Yi-Ping Phoebe Chen, Gaowen Liu, Ramana Rao Kompella

La Trobe University, Cisco Research

Overview

SATtxt is a vision-language foundation model for satellite imagery. We train only the projection heads, keeping both encoders frozen.

Component Backbone Parameters

Vision Encoder DINOv3 ViT-L/16 Frozen

Text Encoder LLM2Vec Llama-3-8B Frozen

Vision Head Transformer Projection Trained

Text Head Linear Projection Trained

Component	Backbone	Parameters
Vision Encoder	DINOv3 ViT-L/16	Frozen
Text Encoder	LLM2Vec Llama-3-8B	Frozen
Vision Head	Transformer Projection	Trained
Text Head	Linear Projection	Trained

Installation

git clone https://github.com/ikhado/sattxt.git
cd sattxt
pip install -r requirements.txt
pip install flash-attn --no-build-isolation  # Required for LLM2Vec

Model Weights

Download the required weights:

Component	Source
DINOv3 ViT-L/16	facebookresearch/dinov3 → `dinov3_vitl16_pretrain_sat493m.pth`
LLM2Vec	McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-unsup-simcse
Vision Head	sattxt_vision_head.pt
Text Head	sattxt_text_head.pt

Clone DINOv3 into the thirdparty folder:

cd thirdparty && git clone https://github.com/facebookresearch/dinov3.git

Quick Start

import sys
from pathlib import Path

import torch

sys.path.insert(0, str(Path(__file__).resolve().parent / "thirdparty" / "dinov3"))

from sattxt.model import SATtxt
from sattxt.utils import image_loader, get_preprocess, zero_shot_classify
device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = SATtxt(
    dinov3_weights_path="/PATH/TO/dinov3_vitl16_pretrain_sat493m-eadcf0ff.pth",
    sattxt_vision_head_pretrain_weights="/PATH/TO/sattxt_vision_head.pt",
    text_encoder_id="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp",
    sattxt_text_head_pretrain_weights="/PATH/TO/sattxt_text_head.pt",
).to(device).eval()

categories = [
    "AnnualCrop", "Forest", "HerbaceousVegetation", "Highway", "Industrial",
    "Pasture", "PermanentCrop", "Residential", "River", "SeaLake"
]

image = image_loader("./asset/Residential_167.jpg")
image_tensor = get_preprocess(is_ms=False, all_bands=False)(image).unsqueeze(0).to(device)

logits, pred_idx = zero_shot_classify(model, image_tensor, categories)

print("Prediction:", categories[pred_idx.item()])

Please check demo.py for more details.

Citation

@misc{do2026sattxt,
      title={Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery}, 
      author={Minh Kha Do and Wei Xiang and Kang Han and Di Wu and Khoa Phan and Yi-Ping Phoebe Chen and Gaowen Liu and Ramana Rao Kompella},
      year={2026},
      eprint={2602.22613},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.22613}, 
}

Acknowledgements

We pretrained the model with: Lightning-Hydra-Template

We use evaluation scripts from: MS-CLIP and Pangaea-Bench

We also use LLMs (such as ChatGPT and Claude) for code refactoring.

We welcome contributions and issues to further improve SATtxt.