Paper-to-Field Classifier (v3)

Transformer-based topic classifier for scientific paper abstracts using the OpenAlex taxonomy (4,516 topics → 245 subfields → 26 fields → 4 domains).

Performance

Metric	Accuracy
Field (26 classes)	86.3%
Domain (4 classes)	94.4%

Usage

from paper_classifier import PaperClassifier

classifier = PaperClassifier()
classifier.initialize()

result = classifier.classify(
    title="Attention Is All You Need",
    abstract="The dominant sequence transduction models are based on complex recurrent or convolutional neural networks..."
)

print(result)
# {
#   'topic': {'id': 10209, 'name': 'Neural Machine Translation and Sequence Models', 'score': 0.87},
#   'subfield': {'id': 1702, 'name': 'Artificial Intelligence'},
#   'field': {'id': 17, 'name': 'Computer Science', 'score': 0.95},
#   'domain': {'id': 3, 'name': 'Physical Sciences'}
# }

Model Details

Architecture: BioM-ELECTRA-Large (~335M params) fine-tuned for 26-class field classification
Fine-tuned on: ~200K paper abstracts with DeepSeek-verified field labels (domain-balanced)
Label quality: Training labels verified by DeepSeek LLM, replacing noisy OpenAlex labels (~50% error rate)
Taxonomy: OpenAlex (4,516 topics, 245 subfields, 26 fields, 4 domains)
Input: Paper title + abstract (tokenizer truncates at 384 tokens)
Field prediction: Classification head (26 classes with sqrt-weighted cross-entropy for class imbalance)
Topic resolution: [CLS] embeddings + FAISS nearest-neighbor within predicted field
GPU recommended for inference (works on CPU but slower)

Training

Trained on 200K domain-balanced paper abstracts from OpenAlex bulk data, re-annotated with DeepSeek LLM for high-quality field labels (confidence >= 0.7 filter applied).

Hyperparameters: lr=1e-5, cosine schedule, batch=32 (grad accum 2 = effective 64), epochs=8, warmup=6%, label smoothing=0.1, fp16, early stopping (patience 5), sqrt inverse-frequency class weights.

Install

pip install paper-classifier

Or from source:

git clone https://github.com/jimnoneill/paper-to-field.git
cd paper-to-field
pip install -e .

Downloads last month: 52

Safetensors

Model size

0.3B params

Tensor type

F32

jimnoneill
/

paper-to-field