Qwen3.6-27B MTPLX Optimized

Run this with MTPLX

MTPLX is an MLX-native runtime for native Multi-Token-Prediction speculative decoding on Apple Silicon. Up to 2.24× faster decode at real coding temperatures (temp=0.6 / top_p=0.95 / top_k=20) using the model's own built-in MTP heads — no external drafter, no greedy hack.

pip install mtplx
mtplx start

Project: github.com/youssofal/MTPLX

Other MTPLX checkpoints:

Qwen3.6-27B-MTPLX-Optimized-Speed — 4-bit flagship speed (63 TPS on M5 Max)
Qwen3.5-4B-MTPLX-Optimized-Speed — small 4-bit speed-test
Qwen3.5-4B-Optimized-MTPLX — small 8-bit

This artifact pairs the Qwen3.6-27B trunk — MLX-quantized with MTPLX's gdn8-speed4 policy (8-bit Gated Delta Network linears, 4-bit MLP, BF16 norms) — with a calibrated INT4 Multi-Token-Prediction sidecar grafted onto the trunk. The MTP head is what enables native speculative decoding: the model drafts its own tokens, with no external draft model required.

MTPLX accepts those draft tokens with mathematically exact probability-ratio acceptance and residual correction, so the speculative path stays distribution-preserving at realistic coding settings (temperature=0.6, top_p=0.95, top_k=20) — not just greedy.

You can also:

Inspect the architecture and MTP tensors with any safetensors reader.
Use the trunk weights with mlx-lm for ordinary autoregressive decoding (the MTP head is sidecar-only and ignored by mlx-lm).
Read the calibration / quantization metadata in mtplx_runtime.json and config.json to understand the build.

What's in this checkpoint

Component	Format
Trunk text + vision weights	MLX-affine mixed-precision: 8-bit Gated Delta Network linears, 4-bit MLP linears, BF16 norms
MTP head sidecar (`mtp.safetensors`)	Calibrated CyanKiwi prequantized INT4 with BF16 MTP norms
Vision encoder (`model-vision-*.safetensors`)	BF16, intact for multimodal use
Runtime contract (`mtplx_runtime.json`)	Pins architecture, recommended profile, and exactness baseline
Tokenizer + chat template	Qwen3.6 vocabulary (248k tokens)

The MTP head is grafted from a separately calibrated INT4 sidecar (Qwen3.6-27B-MTPLX-CyanKiwi-Packed-BF16-INT4-v3) onto the MTPLX-specific GDN8-Speed4 trunk. This combination outperforms BF16 MTP on D2/D3/D4 acceptance under MTPLX's committed-history cache contract.

MTP draft acceptance

These numbers describe the MTP head's draft quality — a property of the model itself, independent of any runtime's wall-clock throughput. Per-position acceptance under exact probability-ratio sampling at temperature=0.6, top_p=0.95, top_k=20:

Depth	This checkpoint	vLLM MTP-5 oracle (3090, same temp)
1	97.62%	92.7%
2	95.24%	77.0%
3	88.10%	63.0%
4	75.61%	50.9%
5	—	43.0%

Higher acceptance at every depth than vLLM's MTP-5 implementation on the same Qwen3.6 family, measured on long_code 192-token prompts.

Provenance

Base model: Qwen/Qwen3.6-27B (Apache 2.0).
Quantization policy: mtplx-gdn8-speed4 — MLX-affine mixed-precision with uniform 8-bit GDN linears, 4-bit MLP, 4-bit lm_head, BF16 norms and the MTP head's fc projection.
MTP sidecar: cyankiwi-calibrated-int4-prequantized, calibrated separately with MLX-affine quantization and grafted onto the GDN8-Speed4 trunk.
Runtime contract: mtplx_runtime.json pins the architecture (qwen3-next-mtp), recommended profile, and exactness baseline.

Limitations

The MTPLX runtime is not yet released. Without it, you can still use the trunk weights with mlx-lm for ordinary AR decoding — but the MTP draft path that this checkpoint was built for requires MTPLX.
Apple Silicon focus. MTPLX targets MLX as its primary backend; CUDA / x86 are not supported.
Verified architecture is Qwen3-Next. MTPLX recognizes other MTP architectures (DeepSeek V3 MTP, GLM4 MoE MTP, MiMo, MiniMax M2 MTP, etc.) but only Qwen3-Next-class artifacts have a verified runtime contract today.

License

This checkpoint is released under the Apache License 2.0, matching the Qwen3.6-27B base model.

Citation

@misc{mtplx2026,
  author       = {Youssof Al},
  title        = {MTPLX: Native MTP speculative decoding on Apple Silicon},
  year         = {2026},
  howpublished = {\url{https://github.com/youssofal/mtplx}}
}

Model tree for Youssofal/Qwen3.6-27B-MTPLX-Optimized

Base model

Qwen/Qwen3.6-27B

Quantized

(280)

this model

Youssofal
/

Qwen3.6-27B-MTPLX-Optimized