You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

ProtSent ESM-2 150M

Contrastively fine-tuned ESM-2 150M protein language model, producing fixed-length embeddings where biological similarity maps to embedding proximity.

Paper: ProtSent: Protein Sentence Transformers Code: github.com/oriel9p/ProtSent 35M model: oriel9p/protsent-esm2-35M

Training

ProtSent applies contrastive fine-tuning using the SentenceTransformers framework with MultipleNegativesRankingLoss (MNRL) and CoSENT on ESM-2 backbones.

This model was trained on five complementary data sources with round-robin sampling:

Dataset Rows/Pairs Loss
Pfam families (linclust@70%) 32.9M domains MNRL
Pfam hard negatives (HMM-derived) 1.8M anchors MNRL
AlphaFold DB structural pairs (Foldseek-grouped) 133.9M sequences MNRL
STRING-DB v12 PPI (score >= 400) 36.5M pairs MNRL
ProteinGym DMS / clinical 2.2M pairs CoSENT

Key hyperparameters: AdamW optimizer, cosine LR schedule, batch size 1024, temperature 0.05, dropout 0.1. Trained on a single NVIDIA RTX 6000 Ada 48GB in ~1.3 days.

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("oriel9p/protsent-esm2-150M")

sequences = [
    "MKTLLLTLVVVTIVCLDLGYT",
    "MKTLLLTLVVVTIVCLDLGYN",  # similar
    "AGWYRSPQEGLKPVDTFKDIV",  # different
]

embeddings = model.encode(sequences)

Compute similarity:

from sentence_transformers.util import cos_sim

similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)

Results

KNN probe (k=3, Euclidean) evaluation on 23 downstream tasks. ProtSent 150M improves 15 of 23 tasks over baseline ESM-2 150M.

Selected highlights vs. baseline ESM-2 150M:

Task Metric Baseline ProtSent Change
Remote Homology (Fold) F1 Macro .190 .390 +105.0%
Variant Effect (GB1) Spearman .670 .785 +17.3%
EC Classification F1 Macro .408 .473 +15.9%
Fluorescence (TAPE) Spearman .504 .569 +12.7%
PPI (Bernett) AUC .556 .592 +6.4%

SCOPe-40 Structural Retrieval

Metric Baseline ProtSent Change
Recall@1 0.423 0.507 +19.9%
Recall@10 0.589 0.685 +16.4%
Recall@30 0.644 0.724 +12.3%

Intended Use

General-purpose protein embeddings for downstream tasks including classification, regression, retrieval, clustering, and similarity search. The embeddings capture evolutionary, structural, and functional relationships. The 150M model offers the strongest retrieval and structural classification performance.

Citation

@article{ofer2026protsent,
  title={ProtSent: Protein Sentence Transformers},
  author={Ofer, Dan and Perets, Oriel and Linial, Michal and Rappoport, Nadav},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train oriel9p/protsent-esm2-150M

Collection including oriel9p/protsent-esm2-150M