Paper-to-Field Classifier (v3)
Transformer-based topic classifier for scientific paper abstracts using the OpenAlex taxonomy (4,516 topics → 245 subfields → 26 fields → 4 domains).
Performance
| Metric | Accuracy |
|---|---|
| Field (26 classes) | 86.3% |
| Domain (4 classes) | 94.4% |
Usage
from paper_classifier import PaperClassifier
classifier = PaperClassifier()
classifier.initialize()
result = classifier.classify(
title="Attention Is All You Need",
abstract="The dominant sequence transduction models are based on complex recurrent or convolutional neural networks..."
)
print(result)
# {
# 'topic': {'id': 10209, 'name': 'Neural Machine Translation and Sequence Models', 'score': 0.87},
# 'subfield': {'id': 1702, 'name': 'Artificial Intelligence'},
# 'field': {'id': 17, 'name': 'Computer Science', 'score': 0.95},
# 'domain': {'id': 3, 'name': 'Physical Sciences'}
# }
Model Details
- Architecture: BioM-ELECTRA-Large (~335M params) fine-tuned for 26-class field classification
- Fine-tuned on: ~200K paper abstracts with DeepSeek-verified field labels (domain-balanced)
- Label quality: Training labels verified by DeepSeek LLM, replacing noisy OpenAlex labels (~50% error rate)
- Taxonomy: OpenAlex (4,516 topics, 245 subfields, 26 fields, 4 domains)
- Input: Paper title + abstract (tokenizer truncates at 384 tokens)
- Field prediction: Classification head (26 classes with sqrt-weighted cross-entropy for class imbalance)
- Topic resolution: [CLS] embeddings + FAISS nearest-neighbor within predicted field
- GPU recommended for inference (works on CPU but slower)
Training
Trained on 200K domain-balanced paper abstracts from OpenAlex bulk data, re-annotated with DeepSeek LLM for high-quality field labels (confidence >= 0.7 filter applied).
Hyperparameters: lr=1e-5, cosine schedule, batch=32 (grad accum 2 = effective 64), epochs=8, warmup=6%, label smoothing=0.1, fp16, early stopping (patience 5), sqrt inverse-frequency class weights.
Install
pip install paper-classifier
Or from source:
git clone https://github.com/jimnoneill/paper-to-field.git
cd paper-to-field
pip install -e .
- Downloads last month
- 52