Dolphin on Apple Neural Engine (Core ML)

This repository packages an Apple Neural Engine–friendly Core ML export of ByteDance’s Dolphin document image parsing model (Donut-Swin encoder + mBART decoder), plus minimal Python code to run the analyze → parse pipeline (JSON outputs only).

It is intended for local / on‑device inference on macOS / iOS with Core ML (iOS 18 deployment target in the conversion scripts).

Not affiliated with ByteDance. This repo is a format conversion of the upstream weights; it does not introduce new training data.

What’s included

Repository layout (what’s actually here):

dolphin_encoder.mlpackage/      # Vision encoder (fp16, static 896×896)
dolphin_decoder.mlpackage/      # Decoder (fp16, stateful KV cache, logits-only)
hf_model_main/                  # Config/tokenizer + model.safetensors weights (included)
ane_end_to_end.py               # Two-stage pipeline (CoreML-only, JSON outputs)
ane_dolphin_encoder_mil.py      # Build encoder mlpackage from safetensors
ane_dolphin_mil.py              # Build decoder mlpackage from safetensors
docs/*.md                       # CoreML I/O and usage notes
requirements.txt                # Runtime + conversion deps (coremltools, torch, transformers, pymupdf…)
demo/2306.02572v1.pdf           # Sample PDF (kept small; delete if you don’t need it)

I/O shapes (important)

Encoder (dolphin_encoder.mlpackage):

Input: pixel_values → float16[1, 3, 896, 896] (NCHW)
Output: encoder_hidden → float16[1, 784, 1024] (28×28 tokens)

Decoder (dolphin_decoder.mlpackage):

Inputs per step:
- dec_ids → int32[1, 1]
- pos → int32[1]
- enc_hidden → float16[1, ENC_SEQ, 1024] (padded from 784 tokens)
- enc_mask → float16[1, 1, 1, ENC_SEQ]
Outputs per step:
- logits → float16[1, 1, vocab]
KV cache is stored as Core ML state (created via decoder.make_state()).

ENC_SEQ and DEC_SEQ are compile-time constants (set when you run the conversion scripts).

Quick start (Python, macOS)

1) Install

python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

2) Run the CoreML pipeline (multi-page PDF or image)

DEC_COMPUTE_UNITS=CPU_AND_NE \
SAVE_OUTPUTS=1 SAVE_INTERMEDIATE_JSON=1 \
SAMPLE_IMG=./demo/2306.02572v1.pdf \
python3 ane_end_to_end.py

Outputs:

outputs/recognition_json/<page>.json (per-page stage2 results)
outputs/intermediate_json/<page>.json (stage1 + stage2)
outputs/pdf_renders/<page>.png (only if input was PDF; for inspection)

Reproducing the conversion (build the `.mlpackage` files)

The conversion scripts build Core ML “mlprogram” models directly in MIL and embed the upstream weights.

1) Install conversion deps

pip install -r requirements.txt

2) Download the upstream Dolphin weights

The scripts expect a safetensors weight file at:

hf_model_main/model.safetensors

Two common options:

# Option A (simple, downloads the whole upstream repo including weights)
git lfs install
git clone https://huggingface.co/ByteDance/Dolphin hf_model_main

# Option B (download only the safetensors via the HF CLI)
huggingface-cli download \
  --local-dir hf_model_main --local-dir-use-symlinks False \
  ByteDance/Dolphin \
  --include "model.safetensors"

3) Build Core ML packages

# Encoder (static 896×896)
python ane_dolphin_encoder_mil.py

# Decoder (KV-cache state; defaults: ENC_SEQ=2048, DEC_SEQ=4096)
python ane_dolphin_mil.py

Outputs:

dolphin_encoder.mlpackage
dolphin_decoder.mlpackage

4) (Optional) Tune sequence lengths

You can reduce memory/latency by compiling with smaller constants:

ENC_SEQ=1024 DEC_SEQ=2048 python ane_dolphin_decoder.py

Notes:

DEC_SEQ controls KV-cache size. With DEC_SEQ=4096, the KV state alone is ~160 MB (10 layers × (K+V) × 16×4096×64 FP16).
ENC_SEQ must be ≥ 784 (the encoder produces 28×28=784 tokens).

Limitations and known issues

Static image size: encoder is compiled for 896×896. The processor resizes/pads to that size.
Greedy decoding: the included runner uses greedy argmax; it does not replicate HF generation features (beam search, sampling, etc.).
ANE is best-effort: Core ML may fall back to GPU/CPU for unsupported ops. Always profile on your target device.
Not a drop-in Transformers model: this repo exports Core ML packages; it is not intended for HF Inference Providers.

License

Conversion code in this repository: MIT (see LICENSE).
Model weights: this is a conversion of ByteDance/Dolphin and inherits the upstream model license.

Citation (upstream)

@inproceedings{dolphin2025,
  title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
  author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and Tang, Jingqun and Liu, Hao and Huang, Can},
  year={2025},
  booktitle={Proceedings of the 65rd Annual Meeting of the Association for Computational Linguistics (ACL)}
}

Acknowledgements

Upstream model + paper: ByteDance/Dolphin
Core ML tooling: coremltools
Processor/tokenizer ecosystem: transformers

Downloads last month: 4

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pkhairkh/bytedance-dolphin1.5-coreml

Base model

ByteDance/Dolphin

Quantized

(2)

this model