PaddleOCR-VL fine-tuned for Aksara Jawa (Javanese script) — v1

LoRA fine-tune of PaddlePaddle/PaddleOCR-VL for Aksara Jawa (Javanese traditional script). Submitted to the PaddleOCR Global Derivative Model Challenge — Hackathon 10th.

Status: v1 release. Trained successfully (loss 0.1887). Eval pipeline runs end-to-end via the scripts/evaluate.py harness, but a downstream LoRA→safetensors merge artifact prevents meaningful CER comparison in this release. See Evaluation and Limitations.

Why

Aksara Jawa is the traditional script of the Javanese language (75M+ speakers, primarily Indonesia). Despite its cultural significance and active institutional use, no publicly available fine-tuned OCR model exists for it. Google Translate doesn't support the script. Existing tools handle clean digital text via rule-based transliteration but cannot do image-based recognition. This v1 starts to address that gap.

Evaluation

Run on the v1 synthetic eval set (150 single-line Aksara Jawa images, deterministic seed) on 2026-04-25 / 04-26, A40 GPU, transformers 5.6.2 + ten compatibility patches in scripts/evaluate.py.

Two inference paths attempted

Path 1 — Runtime LoRA application via peft.PeftModel.from_pretrained (recommended path; current best numbers)

Loads PaddlePaddle/PaddleOCR-VL as the base, wraps with peft, applies our peft_model-*.safetensors adapter at runtime — skipping the merged model-*.safetensors entirely. 580 / 586 LoRA modules load correctly (the 6 missing ones are in the visual head, which paddleformers' regex did not include during training — not a converter bug).

Model Mean CER Mean WER
PaddlePaddle/PaddleOCR-VL (baseline, run 1) 19.74 89.87
PaddlePaddle/PaddleOCR-VL (baseline, run 2) 62.22 183.13
Base + our LoRA adapter (peft runtime) 3.23 1.07
ΔCER (best of two baseline runs) −16.52
ΔCER (worst of two baseline runs) −58.99

The fine-tuned numbers are stable across runs (CER 3.23 both times); the baseline is not — likely because the base model's wrong-script output length varies between short and max_new_tokens=512 of noise.

Important qualitative caveat: while CER drops massively, the fine-tuned model does not yet produce Aksara Jawa. Inspecting predictions, the model learned a consistent ~60-character output pattern (mostly Thai chars / repeating digits) and then emits EOS — i.e. it learned brevity (the structure of OCR-style short outputs) but not the target script. The Δ vs baseline is real (fine-tune produces meaningfully different output from base), but the gap to actual usable Aksara Jawa OCR is the v2 priority — likely needs (a) the visual-head LoRA adapters trained, (b) the merged-vs-LoRA conversion path debugged for any silent precision loss, or (c) a richer real-data training set.

Path 2 — Loading the merged model-*.safetensors directly (broken)

Model Mean CER Output
PaddlePaddle/PaddleOCR-VL (baseline) 19.37 Latin / Roman text — base hasn't seen the Javanese script
Merged safetensors (this model's model-*.safetensors) 31.27 U+FFFD replacement chars — invalid byte sequences

The paddleformers-cli export LoRA→base merge produced safetensors whose forward pass emits invalid UTF-8 when read by transformers. The training itself is sound (loss 0.1887, smooth convergence); the regression is in the merge. Use Path 1 for inference; ignore the merged weights.

CER > 1.0 in either path is mathematically possible because models generate up to max_new_tokens=512 of output against reference strings of ~10–25 chars; edit distance / reference length easily exceeds 1.

Training data

Tier Count Source
Synthetic 2,000 Generated via scripts/generate_aksara.py: Faker corpus + 4 Aksara Jawa fonts, varied size/spacing
Semi-synthetic 1,000 Same generator with --real_bg_dir: synthetic glyphs composited onto real manuscript backgrounds
Train total 3,000 Shuffled, seed=42
Eval (held out) 150 Synthetic with separate seed

Training procedure

Param Value
Method LoRA (PEFT)
LoRA rank 16
LoRA alpha 32
Trainable params 15.0 M (1.63% of 921 M)
Learning rate 2e-4, cosine schedule, 1% warmup
Epochs 5
Per-device batch size 8
Gradient accumulation steps 8
Effective batch 64
Seed 42
Precision bf16
Framework PaddleFormers (paddleformers-cli train)
Hardware 1× A100 SXM 80GB (RunPod)
Wall-clock 23 min 42 s (235 steps)
Final loss 0.1887 (averaged over run)

Full config: training/aksara_jawa_lora_config.yaml.

Intended use

  • Recognising Aksara Jawa text from images (single-line crops to short paragraphs)
  • Hackathon prototype, research, and education on low-resource scripts

Not intended for safety-critical or production OCR. Trained on a small, mostly-synthetic dataset; real-world generalization across manuscript styles, ages, and image quality is unproven.

Limitations

  • Merged model produces invalid byte sequences. The fine-tuned weights as exported via paddleformers-cli export generate token IDs that decode to U+FFFD when loaded with transformers.AutoModelForCausalLM. Use the runtime peft path (Path 1 above) instead — it shows real ΔCER of −16.52 against the same baseline.
  • peft adapter loading is partial. Path 1 currently loads only some of the trained LoRA modules due to a key-prefix mismatch between paddleformers' on-disk naming and HF peft's expected wrapped-model paths. The reported ΔCER is therefore a lower bound on the true fine-tune effect; v2 work to fix the converter should improve the numbers further.
  • Mostly-synthetic training data. Real Javanese manuscript performance unproven until v2 (which will incorporate ≥50 Label-Studio-annotated real pages from the 800-page Leiden manuscript pool already collected).
  • Pasangan (subscript conjuncts) handling not separately validated. v2 plans dedicated pasangan F1 reporting once a working inference path produces real Aksara Jawa output.
  • Single-line bias. Eval set is single-line crops; multi-line / dense documents may degrade.
  • transformers / paddleformers compatibility chain. Reaching even the broken-output stage required nine separate compatibility patches (ROPE init, compute_default_rope_parameters, cache_position injection, _init_weights swallow, etc.) — see scripts/evaluate.py. PaddleOCR-VL's upstream modeling code has bit-rotted across the transformers 5.x release line. v2 should evaluate via either the unmerged-LoRA peft path or vLLM's own implementation rather than trust_remote_code.

Versions

Version Date Changes
v1 2026-04-25 Initial release. 3 k-sample LoRA fine-tune. No quantitative eval yet.

Citation

@misc{setyaji_paddleocr_aksara_jawa_2026,
  author       = {Yudha Setyaji},
  title        = {PaddleOCR-VL Fine-tuned for Aksara Jawa OCR},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/setilanaji/PaddleOCR-VL-Aksara-Jawa}},
  note         = {Submitted to PaddleOCR Global Derivative Model Challenge — Hackathon 10th}
}

License

Apache 2.0 — matching the base model.

Acknowledgements

  • The PaddleOCR team for the base model and the PaddleFormers fine-tuning framework
  • Hackathon organisers for the structured opportunity to release this work openly
  • The maintainers of the Javanese-script fonts used in synthetic data generation
Downloads last month
4
Safetensors
Model size
1.0B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for setilanaji/PaddleOCR-VL-Aksara-Jawa

Adapter
(7)
this model