Remove emoji checkmarks and warning signs

1a5890f verified 16 days ago

8.19 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: answerdotai/ModernBERT-base
	tags:
	- ner
	- named-entity-recognition
	- token-classification
	- knowledge-platform
	- modernbert
	- multilingual
	- patents
	- scientific-papers
	- cross-domain
	- english
	- german
	- generated_from_trainer
	language:
	- en
	- de
	metrics:
	- precision
	- recall
	- f1
	- accuracy
	pipeline_tag: token-classification
	model-index:
	- name: knowledge-platform-ner
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	metrics:
	- type: f1
	value: 0.9063
	name: F1
	- type: precision
	value: 0.8951
	name: Precision
	- type: recall
	value: 0.9178
	name: Recall
	- type: accuracy
	value: 0.9811
	name: Accuracy
	---

	# Knowledge Platform NER

	A cross-domain, multilingual Named Entity Recognition model built for the Knowledge Platform — a system that connects patents, scientific papers, news articles, and political documents across 13 data sources.

	Fine-tuned from [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on 256K+ multilingual documents spanning patents (USPTO, EPO), scientific papers (OpenAlex, arXiv), political documents (Bundestag, EU Parliament), and news.

	## Key Results

	\| Metric \| Score \|
	\|---\|---\|
	\| F1 \| 90.6% \|
	\| Precision \| 89.5% \|
	\| Recall \| 91.8% \|
	\| Accuracy \| 98.1% \|

	## Entity Types

	The model recognizes 15 entity types using BIO tagging (31 labels total):

	\| Tag \| Entity Type \| Example \|
	\|---\|---\|---\|
	\| `PER` \| Person \| James Chen, Lisa Paus, Yann LeCun \|
	\| `ORG` \| Organization \| Samsung Electronics, Bundestag, OpenAI \|
	\| `LOC` \| Location \| Seoul, Brüssel, New York \|
	\| `ANIM` \| Animal \| E. coli, SARS-CoV-2 \|
	\| `BIO` \| Biological \| CRISPR-Cas9, mRNA \|
	\| `CEL` \| Celestial Body \| Mars, Jupiter \|
	\| `DIS` \| Disease \| Alzheimer's, sickle cell disease \|
	\| `EVE` \| Event \| COP28, World Economic Forum \|
	\| `FOOD` \| Food \| glyphosate, insulin \|
	\| `INST` \| Instrument \| LiDAR, mass spectrometer \|
	\| `MEDIA` \| Media/Work \| Nature, The Lancet \|
	\| `MYTH` \| Mythological \| Apollo (program context) \|
	\| `PLANT` \| Plant \| Arabidopsis, cannabis sativa \|
	\| `TIME` \| Time \| Q3 2025, fiscal year 2024 \|
	\| `VEHI` \| Vehicle \| Falcon 9, Boeing 787 \|

	## Use Cases

	This model is designed for knowledge graph construction from heterogeneous document collections:

	- Patent Analysis: Extract assignees, inventors, locations, and technologies from patent filings
	- Scientific Literature: Identify authors, institutions, biological entities, and instruments from papers
	- Political Document Processing: Extract politicians, parties, organizations from parliamentary debates (EN + DE)
	- News Processing: Identify key entities across news articles for event tracking
	- Cross-Domain Knowledge Graphs: Connect entities that appear across different document types and languages

	### Works with the Knowledge Platform Embedding Model

	This model is designed to work alongside [deepakint/knowledge-platform-embeddings](https://huggingface.co/deepakint/knowledge-platform-embeddings) — a SciNCL-based embedding model fine-tuned with contrastive learning on the same document corpus.

	Together they form a pipeline:
	1. This NER model extracts entities (the nodes of a knowledge graph)
	2. The embedding model finds document connections (the edges of a knowledge graph)

	## Quick Start

	```python
	from transformers import pipeline

	ner = pipeline(
	"ner",
	model="deepakint/knowledge-platform-ner",
	aggregation_strategy="max"
	)

	# English patent text
	text = "Samsung Electronics Co., Ltd. filed a patent at the USPTO in Washington, D.C."
	entities = ner(text)

	for entity in entities:
	print(f" {entity['word']:40s} {entity['entity_group']:10s} {entity['score']:.3f}")
	```

	```
	Samsung Electronics Co., Ltd. ORG 1.000
	USPTO ORG 0.998
	Washington, D.C. LOC 0.999
	```

	```python
	# German political text
	text = "Lisa Paus sprach im Deutschen Bundestag in Berlin über die neue Regulierung."
	entities = ner(text)

	for entity in entities:
	print(f" {entity['word']:40s} {entity['entity_group']:10s} {entity['score']:.3f}")
	```

	```
	Lisa Paus PER 1.000
	Deutschen Bundestag ORG 1.000
	Berlin LOC 1.000
	```

	## Grouping Entities by Type

	```python
	from collections import defaultdict

	text = """Apple Inc. CEO Tim Cook announced a new research lab in Palo Alto,
	California, partnering with Stanford University on CRISPR gene editing research."""

	entities = ner(text)
	grouped = defaultdict(list)
	for ent in entities:
	grouped[ent["entity_group"]].append(ent["word"])

	for label, names in sorted(grouped.items()):
	print(f" {label:8s}: {names}")
	```

	```
	BIO : ['CRISPR']
	LOC : ['Palo Alto', 'California']
	ORG : ['Apple Inc.', 'Stanford University']
	PER : ['Tim Cook']
	```

	## Training Details

	### Base Model

	[answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) — a 149M parameter encoder model with:
	- 8,192 token context length (vs. 512 for classic BERT)
	- Rotary Position Embeddings (RoPE)
	- Alternating full + sliding window attention
	- Pre-trained on 2 trillion tokens of English text

	### Training Data

	~256,000 documents from 13 data sources across multiple domains and languages:

	\| Domain \| Sources \| Language \|
	\|---\|---\|---\|
	\| Patents \| USPTO, EPO \| EN, DE \|
	\| Scientific Papers \| OpenAlex, arXiv \| EN \|
	\| Political Documents \| Bundestag, EU Parliament \| DE, EN \|
	\| News \| Various \| EN, DE \|

	### Hyperparameters

	\| Parameter \| Value \|
	\|---\|---\|
	\| Learning rate \| 2e-05 \|
	\| Batch size \| 16 (x2 gradient accumulation = 32 effective) \|
	\| Epochs \| 3 \|
	\| Optimizer \| AdamW \|
	\| LR scheduler \| Cosine with 10% warmup \|
	\| Seed \| 42 \|

	### Training Progress

	\| Epoch \| Training Loss \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| 1 \| 0.1276 \| 0.0766 \| 0.8595 \| 0.8361 \| 0.8476 \| 0.9728 \|
	\| 2 \| 0.0927 \| 0.0623 \| 0.8659 \| 0.8923 \| 0.8789 \| 0.9777 \|
	\| 3 \| 0.0422 \| 0.0694 \| 0.8707 \| 0.8949 \| 0.8827 \| 0.9778 \|

	Note: The best checkpoint (epoch ~2, lowest validation loss 0.0606) was selected as the final model, achieving 90.6% F1.

	## Strengths and Limitations

	### Strengths
	- Cross-domain: Works on patents, papers, news, and political documents with a single model
	- Multilingual: Handles both English and German text
	- Rich entity types: 15 entity types covering people, organizations, locations, biological entities, diseases, instruments, and more
	- Fast: ~5ms per document on CPU — suitable for processing millions of documents
	- Long context: Inherits ModernBERT's 8,192 token context window

	### Limitations
	- Conference/product names: May fragment uncommon compound names (e.g., "NeurIPS" split into tokens) — use confidence thresholding (>0.5) to filter
	- Languages: Optimized for English and German; other languages may work but are untested
	- Domain drift: Performance is best on patent, scientific, political, and news text — may degrade on informal text (social media, chat)

	## Recommended Post-Processing

	For production use, apply a confidence threshold to filter low-quality predictions:

	```python
	# Filter entities with confidence > 0.5
	entities = [e for e in ner(text) if e["score"] > 0.5]
	```

	## Framework Versions

	- Transformers: 5.6.0
	- PyTorch: 2.5.1+cu121
	- Datasets: 4.8.4
	- Tokenizers: 0.22.2

	## Citation

	```bibtex
	@misc{knowledge-platform-ner-2026,
	title={Knowledge Platform NER: Cross-Domain Multilingual Named Entity Recognition},
	author={deepakint},
	year={2026},
	url={https://huggingface.co/deepakint/knowledge-platform-ner}
	}
	```

	## Related Models

	- Embedding Model: [deepakint/knowledge-platform-embeddings](https://huggingface.co/deepakint/knowledge-platform-embeddings) — Cross-domain semantic search and document matching