--- library_name: transformers license: apache-2.0 base_model: answerdotai/ModernBERT-base tags: - ner - named-entity-recognition - token-classification - knowledge-platform - modernbert - multilingual - patents - scientific-papers - cross-domain - english - german - generated_from_trainer language: - en - de metrics: - precision - recall - f1 - accuracy pipeline_tag: token-classification model-index: - name: knowledge-platform-ner results: - task: type: token-classification name: Named Entity Recognition metrics: - type: f1 value: 0.9063 name: F1 - type: precision value: 0.8951 name: Precision - type: recall value: 0.9178 name: Recall - type: accuracy value: 0.9811 name: Accuracy --- # Knowledge Platform NER A cross-domain, multilingual Named Entity Recognition model built for the **Knowledge Platform** — a system that connects patents, scientific papers, news articles, and political documents across 13 data sources. Fine-tuned from [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on 256K+ multilingual documents spanning patents (USPTO, EPO), scientific papers (OpenAlex, arXiv), political documents (Bundestag, EU Parliament), and news. ## Key Results | Metric | Score | |---|---| | **F1** | **90.6%** | | Precision | 89.5% | | Recall | 91.8% | | Accuracy | 98.1% | ## Entity Types The model recognizes **15 entity types** using BIO tagging (31 labels total): | Tag | Entity Type | Example | |---|---|---| | `PER` | Person | *James Chen*, *Lisa Paus*, *Yann LeCun* | | `ORG` | Organization | *Samsung Electronics*, *Bundestag*, *OpenAI* | | `LOC` | Location | *Seoul*, *Brüssel*, *New York* | | `ANIM` | Animal | *E. coli*, *SARS-CoV-2* | | `BIO` | Biological | *CRISPR-Cas9*, *mRNA* | | `CEL` | Celestial Body | *Mars*, *Jupiter* | | `DIS` | Disease | *Alzheimer's*, *sickle cell disease* | | `EVE` | Event | *COP28*, *World Economic Forum* | | `FOOD` | Food | *glyphosate*, *insulin* | | `INST` | Instrument | *LiDAR*, *mass spectrometer* | | `MEDIA` | Media/Work | *Nature*, *The Lancet* | | `MYTH` | Mythological | *Apollo* (program context) | | `PLANT` | Plant | *Arabidopsis*, *cannabis sativa* | | `TIME` | Time | *Q3 2025*, *fiscal year 2024* | | `VEHI` | Vehicle | *Falcon 9*, *Boeing 787* | ## Use Cases This model is designed for **knowledge graph construction** from heterogeneous document collections: - **Patent Analysis**: Extract assignees, inventors, locations, and technologies from patent filings - **Scientific Literature**: Identify authors, institutions, biological entities, and instruments from papers - **Political Document Processing**: Extract politicians, parties, organizations from parliamentary debates (EN + DE) - **News Processing**: Identify key entities across news articles for event tracking - **Cross-Domain Knowledge Graphs**: Connect entities that appear across different document types and languages ### Works with the Knowledge Platform Embedding Model This model is designed to work alongside [deepakint/knowledge-platform-embeddings](https://huggingface.co/deepakint/knowledge-platform-embeddings) — a SciNCL-based embedding model fine-tuned with contrastive learning on the same document corpus. **Together they form a pipeline:** 1. **This NER model** extracts entities (the nodes of a knowledge graph) 2. **The embedding model** finds document connections (the edges of a knowledge graph) ## Quick Start ```python from transformers import pipeline ner = pipeline( "ner", model="deepakint/knowledge-platform-ner", aggregation_strategy="max" ) # English patent text text = "Samsung Electronics Co., Ltd. filed a patent at the USPTO in Washington, D.C." entities = ner(text) for entity in entities: print(f" {entity['word']:40s} {entity['entity_group']:10s} {entity['score']:.3f}") ``` ``` Samsung Electronics Co., Ltd. ORG 1.000 USPTO ORG 0.998 Washington, D.C. LOC 0.999 ``` ```python # German political text text = "Lisa Paus sprach im Deutschen Bundestag in Berlin über die neue Regulierung." entities = ner(text) for entity in entities: print(f" {entity['word']:40s} {entity['entity_group']:10s} {entity['score']:.3f}") ``` ``` Lisa Paus PER 1.000 Deutschen Bundestag ORG 1.000 Berlin LOC 1.000 ``` ## Grouping Entities by Type ```python from collections import defaultdict text = """Apple Inc. CEO Tim Cook announced a new research lab in Palo Alto, California, partnering with Stanford University on CRISPR gene editing research.""" entities = ner(text) grouped = defaultdict(list) for ent in entities: grouped[ent["entity_group"]].append(ent["word"]) for label, names in sorted(grouped.items()): print(f" {label:8s}: {names}") ``` ``` BIO : ['CRISPR'] LOC : ['Palo Alto', 'California'] ORG : ['Apple Inc.', 'Stanford University'] PER : ['Tim Cook'] ``` ## Training Details ### Base Model [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) — a 149M parameter encoder model with: - 8,192 token context length (vs. 512 for classic BERT) - Rotary Position Embeddings (RoPE) - Alternating full + sliding window attention - Pre-trained on 2 trillion tokens of English text ### Training Data ~256,000 documents from 13 data sources across multiple domains and languages: | Domain | Sources | Language | |---|---|---| | Patents | USPTO, EPO | EN, DE | | Scientific Papers | OpenAlex, arXiv | EN | | Political Documents | Bundestag, EU Parliament | DE, EN | | News | Various | EN, DE | ### Hyperparameters | Parameter | Value | |---|---| | Learning rate | 2e-05 | | Batch size | 16 (x2 gradient accumulation = 32 effective) | | Epochs | 3 | | Optimizer | AdamW | | LR scheduler | Cosine with 10% warmup | | Seed | 42 | ### Training Progress | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | |:---:|:---:|:---:|:---:|:---:|:---:|:---:| | 1 | 0.1276 | 0.0766 | 0.8595 | 0.8361 | 0.8476 | 0.9728 | | 2 | 0.0927 | 0.0623 | 0.8659 | 0.8923 | 0.8789 | 0.9777 | | 3 | 0.0422 | 0.0694 | 0.8707 | 0.8949 | 0.8827 | 0.9778 | **Note:** The best checkpoint (epoch ~2, lowest validation loss 0.0606) was selected as the final model, achieving **90.6% F1**. ## Strengths and Limitations ### Strengths - **Cross-domain**: Works on patents, papers, news, and political documents with a single model - **Multilingual**: Handles both English and German text - **Rich entity types**: 15 entity types covering people, organizations, locations, biological entities, diseases, instruments, and more - **Fast**: ~5ms per document on CPU — suitable for processing millions of documents - **Long context**: Inherits ModernBERT's 8,192 token context window ### Limitations - **Conference/product names**: May fragment uncommon compound names (e.g., "NeurIPS" split into tokens) — use confidence thresholding (>0.5) to filter - **Languages**: Optimized for English and German; other languages may work but are untested - **Domain drift**: Performance is best on patent, scientific, political, and news text — may degrade on informal text (social media, chat) ## Recommended Post-Processing For production use, apply a confidence threshold to filter low-quality predictions: ```python # Filter entities with confidence > 0.5 entities = [e for e in ner(text) if e["score"] > 0.5] ``` ## Framework Versions - Transformers: 5.6.0 - PyTorch: 2.5.1+cu121 - Datasets: 4.8.4 - Tokenizers: 0.22.2 ## Citation ```bibtex @misc{knowledge-platform-ner-2026, title={Knowledge Platform NER: Cross-Domain Multilingual Named Entity Recognition}, author={deepakint}, year={2026}, url={https://huggingface.co/deepakint/knowledge-platform-ner} } ``` ## Related Models - **Embedding Model**: [deepakint/knowledge-platform-embeddings](https://huggingface.co/deepakint/knowledge-platform-embeddings) — Cross-domain semantic search and document matching