Korean Neural Sparse Encoder (V33-ecom-v6)

SPLADEModernBERT sparse retrieval model for Korean, fine-tuned for e-commerce product search.

Model Details

  • Architecture: SPLADE-max (MLM โ†’ log(1+ReLU) โ†’ max pool)
  • Backbone: skt/A.X-Encoder-base (ModernBERT, 149M params)
  • Vocab: 50K tokens (48.4% Korean)
  • Sparsity: nz_qโ‰ˆ40, nz_dโ‰ˆ68 (ultra-sparse)
  • Training: V33 base (25 epochs, 4.84M triplets) + V6 ecom fine-tune (10 epochs)

E-Commerce Fine-Tuning (V6)

  • 3K e-commerce triplets + 12K general triplets (20:80 ratio)
  • Single negative, no MarginMSE/KD
  • LR 2e-6, batch 64, 10 epochs
  • E-commerce data: Korean Wikipedia product articles with TF-IDF hard negatives

Benchmark Results

Benchmark BM25 V33 Base V33-ecom-v6 BGE-M3
Ko-StrategyQA (R@1) 53.7% 62.2% 63.0% 73.5%
MIRACL-ko (R@1) 44.1% 62.0% 62.0% 70.9%
Mr.TyDi-ko (R@1) 55.6% 73.4% 75.3% 84.1%
ecom-ko (R@1) - 67.8% 78.6% -

Key improvements over V33 base:

  • ecom-ko: +10.8pp (67.8% โ†’ 78.6%)
  • Mr.TyDi-ko: +1.9pp (73.4% โ†’ 75.3%)
  • Ko-StrategyQA: +0.8pp (62.2% โ†’ 63.0%)
  • No regression on MIRACL-ko

Usage with OpenSearch

// 1. Register model via ML Commons
POST /_plugins/_ml/models/_register
{
  "name": "korean-neural-sparse-encoder",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "model_type": "sparse_encoding",
    "embedding_dimension": 1,
    "framework_type": "huggingface_transformers"
  },
  "url": "https://huggingface.co/sewoong/korean-neural-sparse-encoder"
}

// 2. Create index with neural sparse field
PUT /my-ecom-index
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "text": { "type": "text" },
      "sparse_embedding": {
        "type": "rank_features"
      }
    }
  }
}

// 3. Neural sparse search query
GET /my-ecom-index/_search
{
  "query": {
    "neural_sparse": {
      "sparse_embedding": {
        "query_text": "์‚ผ์„ฑ ๊ฐค๋Ÿญ์‹œ ๋ฌด์„  ์ด์–ดํฐ",
        "model_id": "<model_id>"
      }
    }
  }
}

Usage with Transformers

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("sewoong/korean-neural-sparse-encoder")
model = AutoModelForMaskedLM.from_pretrained("sewoong/korean-neural-sparse-encoder")
model.eval()


def encode_sparse(text, max_length=256):
    """Encode text to sparse vector using SPLADE-max."""
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=max_length,
        truncation=True,
        padding=True,
    )
    with torch.no_grad():
        logits = model(**inputs).logits

    # SPLADE-max: log(1 + ReLU(logits)) then max-pool over sequence
    sparse = torch.log1p(torch.relu(logits)).max(dim=1).values

    # Zero out special tokens
    special_ids = [tokenizer.cls_token_id, tokenizer.sep_token_id, tokenizer.pad_token_id]
    for sid in special_ids:
        if sid is not None:
            sparse[0, sid] = 0

    # Extract non-zero token weights
    idx = sparse[0].nonzero().squeeze(-1)
    weights = sparse[0, idx]
    tokens = tokenizer.convert_ids_to_tokens(idx.tolist())
    return dict(zip(tokens, weights.tolist()))


# Example: encode a query
result = encode_sparse("์‚ผ์„ฑ ๊ฐค๋Ÿญ์‹œ ๋ฌด์„  ์ด์–ดํฐ")
for token, weight in sorted(result.items(), key=lambda x: -x[1])[:10]:
    print(f"{token}: {weight:.3f}")

# Example: encode a document
doc_result = encode_sparse(
    "์‚ผ์„ฑ ๊ฐค๋Ÿญ์‹œ ๋ฒ„์ฆˆ3 ํ”„๋กœ ๋ฌด์„  ์ด์–ดํฐ. 24bit Hi-Fi ์‚ฌ์šด๋“œ์™€ "
    "์ธํ…”๋ฆฌ์ „ํŠธ ์•กํ‹ฐ๋ธŒ ๋…ธ์ด์ฆˆ ์บ”์Šฌ๋ง์œผ๋กœ ๋ชฐ์ž…๊ฐ ์žˆ๋Š” ์ฒญ์ทจ ๊ฒฝํ—˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.",
    max_length=256,
)
print(f"\nDocument non-zero tokens: {len(doc_result)}")

Training Details

V33 Base Training

The base model was trained from scratch on 4.84M Korean language triplets spanning 46 diverse datasets.

Parameter Value
Base model skt/A.X-Encoder-base (ModernBERT)
Parameters 149M
Training data 4.84M triplets (46 shards)
Loss InfoNCE + FLOPS regularization
FLOPS lambda_q 0.01
FLOPS lambda_d 0.003
Batch size 2048 (effective)
Learning rate 5e-5
Epochs 25
Hardware 8x NVIDIA B200 (183GB each)
Precision bf16
Final nz_q 35
Final nz_d 58

V6 E-Commerce Fine-Tuning

Fine-tuned from V33 checkpoint with domain-specific e-commerce data while preserving general retrieval quality.

Parameter Value
Training data 3K ecom + 12K general (20:80 ratio)
E-commerce source Korean Wikipedia product articles
Negative mining TF-IDF char_wb ngrams (2,3)
Loss InfoNCE + FLOPS (same as V33, no KD)
Learning rate 2e-6
Batch size 64 per GPU
Grad accumulation 4
Epochs 10
Total steps 77
Final nz_q 40
Final nz_d 68

Key lessons from e-commerce fine-tuning:

  • MarginMSE/KD is harmful for sparse fine-tuning (destroys learned representations)
  • Must match original training recipe exactly (single neg, same FLOPS lambda)
  • 20:80 domain/general ratio prevents catastrophic forgetting
  • Category balance: 20 e-commerce categories (electronics, fashion, beauty, food, appliances, furniture, kitchen, sports, automotive, pets, baby, health supplements, stationery, books, music, household, crafts, instruments, camping, gardening)

Fine-Tuning Toolkit

A complete toolkit for fine-tuning this model with your domain-specific data is included in the finetune/ directory.

Quick Start

# Clone the fine-tuning toolkit
git clone https://huggingface.co/sewoong/korean-neural-sparse-encoder
cd korean-neural-sparse-encoder/finetune

# Install dependencies
pip install -r requirements.txt

# Prepare your data (CSV/TSV -> JSONL)
python scripts/prepare_data.py --input your_data.csv --output data/your_domain/

# Fine-tune
python scripts/finetune.py --config configs/finetune.yaml

# Test encoding
python scripts/encode.py --model sewoong/korean-neural-sparse-encoder --text "samsung galaxy wireless earbuds"

Toolkit Contents

File Description
finetune/scripts/finetune.py Self-contained fine-tuning script (InfoNCE + FLOPS)
finetune/scripts/encode.py Sparse vector encoding and similarity
finetune/scripts/prepare_data.py CSV/TSV to JSONL data converter
finetune/configs/finetune.yaml Proven V33-ecom-v6 training recipe
finetune/notebooks/finetune_tutorial.ipynb Step-by-step Colab tutorial
finetune/data/sample/ 50 train + 10 val Korean e-commerce sample triplets

GitHub repository: mateon01/korean-neural-sparse-encoder

Citation

@misc{korean-neural-sparse-encoder,
  title={Korean Neural Sparse Encoder},
  author={Sewoong Kim},
  year={2026},
  url={https://huggingface.co/sewoong/korean-neural-sparse-encoder}
}
Downloads last month
162
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sewoong/korean-neural-sparse-encoder

Finetuned
(6)
this model