ProtSent ESM-2 150M
Contrastively fine-tuned ESM-2 150M protein language model, producing fixed-length embeddings where biological similarity maps to embedding proximity.
Paper: ProtSent: Protein Sentence Transformers Code: github.com/oriel9p/ProtSent 35M model: oriel9p/protsent-esm2-35M
Training
ProtSent applies contrastive fine-tuning using the SentenceTransformers framework with MultipleNegativesRankingLoss (MNRL) and CoSENT on ESM-2 backbones.
This model was trained on five complementary data sources with round-robin sampling:
| Dataset | Rows/Pairs | Loss |
|---|---|---|
| Pfam families (linclust@70%) | 32.9M domains | MNRL |
| Pfam hard negatives (HMM-derived) | 1.8M anchors | MNRL |
| AlphaFold DB structural pairs (Foldseek-grouped) | 133.9M sequences | MNRL |
| STRING-DB v12 PPI (score >= 400) | 36.5M pairs | MNRL |
| ProteinGym DMS / clinical | 2.2M pairs | CoSENT |
Key hyperparameters: AdamW optimizer, cosine LR schedule, batch size 1024, temperature 0.05, dropout 0.1. Trained on a single NVIDIA RTX 6000 Ada 48GB in ~1.3 days.
Quick Start
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("oriel9p/protsent-esm2-150M")
sequences = [
"MKTLLLTLVVVTIVCLDLGYT",
"MKTLLLTLVVVTIVCLDLGYN", # similar
"AGWYRSPQEGLKPVDTFKDIV", # different
]
embeddings = model.encode(sequences)
Compute similarity:
from sentence_transformers.util import cos_sim
similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)
Results
KNN probe (k=3, Euclidean) evaluation on 23 downstream tasks. ProtSent 150M improves 15 of 23 tasks over baseline ESM-2 150M.
Selected highlights vs. baseline ESM-2 150M:
| Task | Metric | Baseline | ProtSent | Change |
|---|---|---|---|---|
| Remote Homology (Fold) | F1 Macro | .190 | .390 | +105.0% |
| Variant Effect (GB1) | Spearman | .670 | .785 | +17.3% |
| EC Classification | F1 Macro | .408 | .473 | +15.9% |
| Fluorescence (TAPE) | Spearman | .504 | .569 | +12.7% |
| PPI (Bernett) | AUC | .556 | .592 | +6.4% |
SCOPe-40 Structural Retrieval
| Metric | Baseline | ProtSent | Change |
|---|---|---|---|
| Recall@1 | 0.423 | 0.507 | +19.9% |
| Recall@10 | 0.589 | 0.685 | +16.4% |
| Recall@30 | 0.644 | 0.724 | +12.3% |
Intended Use
General-purpose protein embeddings for downstream tasks including classification, regression, retrieval, clustering, and similarity search. The embeddings capture evolutionary, structural, and functional relationships. The 150M model offers the strongest retrieval and structural classification performance.
Citation
@article{ofer2026protsent,
title={ProtSent: Protein Sentence Transformers},
author={Ofer, Dan and Perets, Oriel and Linial, Michal and Rappoport, Nadav},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}