Korean Neural Sparse Encoder (V33-ecom-v6)
SPLADEModernBERT sparse retrieval model for Korean, fine-tuned for e-commerce product search.
Model Details
- Architecture: SPLADE-max (MLM โ log(1+ReLU) โ max pool)
- Backbone: skt/A.X-Encoder-base (ModernBERT, 149M params)
- Vocab: 50K tokens (48.4% Korean)
- Sparsity: nz_qโ40, nz_dโ68 (ultra-sparse)
- Training: V33 base (25 epochs, 4.84M triplets) + V6 ecom fine-tune (10 epochs)
E-Commerce Fine-Tuning (V6)
- 3K e-commerce triplets + 12K general triplets (20:80 ratio)
- Single negative, no MarginMSE/KD
- LR 2e-6, batch 64, 10 epochs
- E-commerce data: Korean Wikipedia product articles with TF-IDF hard negatives
Benchmark Results
| Benchmark | BM25 | V33 Base | V33-ecom-v6 | BGE-M3 |
|---|---|---|---|---|
| Ko-StrategyQA (R@1) | 53.7% | 62.2% | 63.0% | 73.5% |
| MIRACL-ko (R@1) | 44.1% | 62.0% | 62.0% | 70.9% |
| Mr.TyDi-ko (R@1) | 55.6% | 73.4% | 75.3% | 84.1% |
| ecom-ko (R@1) | - | 67.8% | 78.6% | - |
Key improvements over V33 base:
- ecom-ko: +10.8pp (67.8% โ 78.6%)
- Mr.TyDi-ko: +1.9pp (73.4% โ 75.3%)
- Ko-StrategyQA: +0.8pp (62.2% โ 63.0%)
- No regression on MIRACL-ko
Usage with OpenSearch
// 1. Register model via ML Commons
POST /_plugins/_ml/models/_register
{
"name": "korean-neural-sparse-encoder",
"version": "1.0.0",
"model_format": "TORCH_SCRIPT",
"model_config": {
"model_type": "sparse_encoding",
"embedding_dimension": 1,
"framework_type": "huggingface_transformers"
},
"url": "https://huggingface.co/sewoong/korean-neural-sparse-encoder"
}
// 2. Create index with neural sparse field
PUT /my-ecom-index
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"text": { "type": "text" },
"sparse_embedding": {
"type": "rank_features"
}
}
}
}
// 3. Neural sparse search query
GET /my-ecom-index/_search
{
"query": {
"neural_sparse": {
"sparse_embedding": {
"query_text": "์ผ์ฑ ๊ฐค๋ญ์ ๋ฌด์ ์ด์ดํฐ",
"model_id": "<model_id>"
}
}
}
}
Usage with Transformers
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("sewoong/korean-neural-sparse-encoder")
model = AutoModelForMaskedLM.from_pretrained("sewoong/korean-neural-sparse-encoder")
model.eval()
def encode_sparse(text, max_length=256):
"""Encode text to sparse vector using SPLADE-max."""
inputs = tokenizer(
text,
return_tensors="pt",
max_length=max_length,
truncation=True,
padding=True,
)
with torch.no_grad():
logits = model(**inputs).logits
# SPLADE-max: log(1 + ReLU(logits)) then max-pool over sequence
sparse = torch.log1p(torch.relu(logits)).max(dim=1).values
# Zero out special tokens
special_ids = [tokenizer.cls_token_id, tokenizer.sep_token_id, tokenizer.pad_token_id]
for sid in special_ids:
if sid is not None:
sparse[0, sid] = 0
# Extract non-zero token weights
idx = sparse[0].nonzero().squeeze(-1)
weights = sparse[0, idx]
tokens = tokenizer.convert_ids_to_tokens(idx.tolist())
return dict(zip(tokens, weights.tolist()))
# Example: encode a query
result = encode_sparse("์ผ์ฑ ๊ฐค๋ญ์ ๋ฌด์ ์ด์ดํฐ")
for token, weight in sorted(result.items(), key=lambda x: -x[1])[:10]:
print(f"{token}: {weight:.3f}")
# Example: encode a document
doc_result = encode_sparse(
"์ผ์ฑ ๊ฐค๋ญ์ ๋ฒ์ฆ3 ํ๋ก ๋ฌด์ ์ด์ดํฐ. 24bit Hi-Fi ์ฌ์ด๋์ "
"์ธํ
๋ฆฌ์ ํธ ์กํฐ๋ธ ๋
ธ์ด์ฆ ์บ์ฌ๋ง์ผ๋ก ๋ชฐ์
๊ฐ ์๋ ์ฒญ์ทจ ๊ฒฝํ์ ์ ๊ณตํฉ๋๋ค.",
max_length=256,
)
print(f"\nDocument non-zero tokens: {len(doc_result)}")
Training Details
V33 Base Training
The base model was trained from scratch on 4.84M Korean language triplets spanning 46 diverse datasets.
| Parameter | Value |
|---|---|
| Base model | skt/A.X-Encoder-base (ModernBERT) |
| Parameters | 149M |
| Training data | 4.84M triplets (46 shards) |
| Loss | InfoNCE + FLOPS regularization |
| FLOPS lambda_q | 0.01 |
| FLOPS lambda_d | 0.003 |
| Batch size | 2048 (effective) |
| Learning rate | 5e-5 |
| Epochs | 25 |
| Hardware | 8x NVIDIA B200 (183GB each) |
| Precision | bf16 |
| Final nz_q | 35 |
| Final nz_d | 58 |
V6 E-Commerce Fine-Tuning
Fine-tuned from V33 checkpoint with domain-specific e-commerce data while preserving general retrieval quality.
| Parameter | Value |
|---|---|
| Training data | 3K ecom + 12K general (20:80 ratio) |
| E-commerce source | Korean Wikipedia product articles |
| Negative mining | TF-IDF char_wb ngrams (2,3) |
| Loss | InfoNCE + FLOPS (same as V33, no KD) |
| Learning rate | 2e-6 |
| Batch size | 64 per GPU |
| Grad accumulation | 4 |
| Epochs | 10 |
| Total steps | 77 |
| Final nz_q | 40 |
| Final nz_d | 68 |
Key lessons from e-commerce fine-tuning:
- MarginMSE/KD is harmful for sparse fine-tuning (destroys learned representations)
- Must match original training recipe exactly (single neg, same FLOPS lambda)
- 20:80 domain/general ratio prevents catastrophic forgetting
- Category balance: 20 e-commerce categories (electronics, fashion, beauty, food, appliances, furniture, kitchen, sports, automotive, pets, baby, health supplements, stationery, books, music, household, crafts, instruments, camping, gardening)
Fine-Tuning Toolkit
A complete toolkit for fine-tuning this model with your domain-specific data is included in the finetune/ directory.
Quick Start
# Clone the fine-tuning toolkit
git clone https://huggingface.co/sewoong/korean-neural-sparse-encoder
cd korean-neural-sparse-encoder/finetune
# Install dependencies
pip install -r requirements.txt
# Prepare your data (CSV/TSV -> JSONL)
python scripts/prepare_data.py --input your_data.csv --output data/your_domain/
# Fine-tune
python scripts/finetune.py --config configs/finetune.yaml
# Test encoding
python scripts/encode.py --model sewoong/korean-neural-sparse-encoder --text "samsung galaxy wireless earbuds"
Toolkit Contents
| File | Description |
|---|---|
finetune/scripts/finetune.py |
Self-contained fine-tuning script (InfoNCE + FLOPS) |
finetune/scripts/encode.py |
Sparse vector encoding and similarity |
finetune/scripts/prepare_data.py |
CSV/TSV to JSONL data converter |
finetune/configs/finetune.yaml |
Proven V33-ecom-v6 training recipe |
finetune/notebooks/finetune_tutorial.ipynb |
Step-by-step Colab tutorial |
finetune/data/sample/ |
50 train + 10 val Korean e-commerce sample triplets |
GitHub repository: mateon01/korean-neural-sparse-encoder
Citation
@misc{korean-neural-sparse-encoder,
title={Korean Neural Sparse Encoder},
author={Sewoong Kim},
year={2026},
url={https://huggingface.co/sewoong/korean-neural-sparse-encoder}
}
- Downloads last month
- 162
Model tree for sewoong/korean-neural-sparse-encoder
Base model
skt/A.X-Encoder-base