deep-stylometry-modernbert-mean
ModernBERT-base with mean-pooling for contrastive authorship attribution, fine-tuned on HALvest-Contrastive.
Model description
This model is a ModernBERT-base encoder with a two-layer projection head, fine-tuned with InfoNCE contrastive loss on the HALvest-Contrastive authorship attribution benchmark.
Interaction method: Single-vector mean pooling over all ModernBERT token embeddings, followed by cosine similarity. This is the simplest baseline: each text is represented by a single 768-dimensional vector.
Intended use
Authorship attribution and authorship verification on English text. Given a query text and a set of candidate texts, the model scores how likely each candidate was written by the same author as the query.
How to use
This checkpoint is a PyTorch Lightning .ckpt file. You need the deep_stylometry codebase to load it.
Installation
git clone https://github.com/Madjakul/deep_stylometry.git
cd deep_stylometry
pip install -r requirements.txt
Minimal inference
import torch
from transformers import AutoTokenizer
from deep_stylometry.modules.modeling_deep_stylometry import DeepStylometry
from deep_stylometry.utils.configs import BaseConfig
# 1. Load model from checkpoint
cfg = BaseConfig(mode="test").from_yaml("configs/test_mean.yml")
model = DeepStylometry.load_from_checkpoint("last.ckpt", cfg=cfg)
model.eval()
# 2. Tokenize
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
texts = [
"Query text whose authorship you want to identify.",
"Candidate A: a text by the same author.",
"Candidate B: a text by a different author.",
]
enc = tokenizer(
texts, padding=True, truncation=True, max_length=512, return_tensors="pt",
)
# 3. Encode all texts through the model
with torch.no_grad():
embs = model(enc["input_ids"], enc["attention_mask"]) # (3, seq_len, 768)
# 4. Score the query against each candidate
pool = model.contrastive_loss.pool # the interaction module
with torch.no_grad():
scores = pool(
query_embs=embs[:1],
key_embs=embs[1:],
q_mask=enc['attention_mask'][:1],
k_mask=enc['attention_mask'][1:],
)
# scores shape: (1, 2) -- similarity of the query to [candidate A, candidate B]
# Higher score = more likely same author.
print(scores)
Training details
The model was trained on HALvest-Contrastive (English-only contrastive triplets derived from the HAL open-access scholarly repository) using InfoNCE loss with in-batch negatives. Training used 4x H100 GPUs with DDP, mixed precision (fp16), and a cosine learning rate schedule with linear warmup.
See the training configuration for full hyperparameters.
Citation
@misc{kulumba_halvest_2026,
title={HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction},
author={Francis Kulumba and Wissam Antoun and Guillaume Vimont and Laurent Romary and Florian Cafiero},
year={2026},
eprint={2407.20595},
archivePrefix={arXiv},
primaryClass={cs.DL},
url={https://arxiv.org/abs/2407.20595},
}
@misc{kulumba_does_2026,
title={Where Does Authorship Signal Emerge in Encoder-Based Language Models?},
author={Francis Kulumba and Guillaume Vimont and Laurent Romary and Florian Cafiero},
year={2026},
eprint={2605.19908},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.19908},
}
Model tree for Madjakul/deep-stylometry-modernbert-mean
Base model
answerdotai/ModernBERT-base