You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ProtSent ESM-2 150M

Contrastively fine-tuned ESM-2 150M protein language model, producing fixed-length embeddings where biological similarity maps to embedding proximity.

Paper: ProtSent: Protein Sentence Transformers Code: github.com/oriel9p/ProtSent 35M model: oriel9p/protsent-esm2-35M

Training

ProtSent applies contrastive fine-tuning using the SentenceTransformers framework with MultipleNegativesRankingLoss (MNRL) and CoSENT on ESM-2 backbones.

This model was trained on five complementary data sources with round-robin sampling:

Dataset	Rows/Pairs	Loss
Pfam families (linclust@70%)	32.9M domains	MNRL
Pfam hard negatives (HMM-derived)	1.8M anchors	MNRL
AlphaFold DB structural pairs (Foldseek-grouped)	133.9M sequences	MNRL
STRING-DB v12 PPI (score >= 400)	36.5M pairs	MNRL
ProteinGym DMS / clinical	2.2M pairs	CoSENT

Key hyperparameters: AdamW optimizer, cosine LR schedule, batch size 1024, temperature 0.05, dropout 0.1. Trained on a single NVIDIA RTX 6000 Ada 48GB in ~1.3 days.

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("oriel9p/protsent-esm2-150M")

sequences = [
    "MKTLLLTLVVVTIVCLDLGYT",
    "MKTLLLTLVVVTIVCLDLGYN",  # similar
    "AGWYRSPQEGLKPVDTFKDIV",  # different
]

embeddings = model.encode(sequences)

Compute similarity:

from sentence_transformers.util import cos_sim

similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)

Results

KNN probe (k=3, Euclidean) evaluation on 23 downstream tasks. ProtSent 150M improves 15 of 23 tasks over baseline ESM-2 150M.

Selected highlights vs. baseline ESM-2 150M:

Task	Metric	Baseline	ProtSent	Change
Remote Homology (Fold)	F1 Macro	.190	.390	+105.0%
Variant Effect (GB1)	Spearman	.670	.785	+17.3%
EC Classification	F1 Macro	.408	.473	+15.9%
Fluorescence (TAPE)	Spearman	.504	.569	+12.7%
PPI (Bernett)	AUC	.556	.592	+6.4%

SCOPe-40 Structural Retrieval

Metric	Baseline	ProtSent	Change
Recall@1	0.423	0.507	+19.9%
Recall@10	0.589	0.685	+16.4%
Recall@30	0.644	0.724	+12.3%

Intended Use

General-purpose protein embeddings for downstream tasks including classification, regression, retrieval, clustering, and similarity search. The embeddings capture evolutionary, structural, and functional relationships. The 150M model offers the strongest retrieval and structural classification performance.

Citation

@article{ofer2026protsent,
  title={ProtSent: Protein Sentence Transformers},
  author={Ofer, Dan and Perets, Oriel and Linial, Michal and Rappoport, Nadav},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

oriel9p
/

protsent-esm2-150M