Paper-to-Field Classifier (v3)

Transformer-based topic classifier for scientific paper abstracts using the OpenAlex taxonomy (4,516 topics → 245 subfields → 26 fields → 4 domains).

Performance

Metric Accuracy
Field (26 classes) 86.3%
Domain (4 classes) 94.4%

Usage

from paper_classifier import PaperClassifier

classifier = PaperClassifier()
classifier.initialize()

result = classifier.classify(
    title="Attention Is All You Need",
    abstract="The dominant sequence transduction models are based on complex recurrent or convolutional neural networks..."
)

print(result)
# {
#   'topic': {'id': 10209, 'name': 'Neural Machine Translation and Sequence Models', 'score': 0.87},
#   'subfield': {'id': 1702, 'name': 'Artificial Intelligence'},
#   'field': {'id': 17, 'name': 'Computer Science', 'score': 0.95},
#   'domain': {'id': 3, 'name': 'Physical Sciences'}
# }

Model Details

  • Architecture: BioM-ELECTRA-Large (~335M params) fine-tuned for 26-class field classification
  • Fine-tuned on: ~200K paper abstracts with DeepSeek-verified field labels (domain-balanced)
  • Label quality: Training labels verified by DeepSeek LLM, replacing noisy OpenAlex labels (~50% error rate)
  • Taxonomy: OpenAlex (4,516 topics, 245 subfields, 26 fields, 4 domains)
  • Input: Paper title + abstract (tokenizer truncates at 384 tokens)
  • Field prediction: Classification head (26 classes with sqrt-weighted cross-entropy for class imbalance)
  • Topic resolution: [CLS] embeddings + FAISS nearest-neighbor within predicted field
  • GPU recommended for inference (works on CPU but slower)

Training

Trained on 200K domain-balanced paper abstracts from OpenAlex bulk data, re-annotated with DeepSeek LLM for high-quality field labels (confidence >= 0.7 filter applied).

Hyperparameters: lr=1e-5, cosine schedule, batch=32 (grad accum 2 = effective 64), epochs=8, warmup=6%, label smoothing=0.1, fp16, early stopping (patience 5), sqrt inverse-frequency class weights.

Install

pip install paper-classifier

Or from source:

git clone https://github.com/jimnoneill/paper-to-field.git
cd paper-to-field
pip install -e .
Downloads last month
52
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train jimnoneill/paper-to-field