Qwen3.6-27B MTPLX Optimized
Run this with MTPLX
MTPLX is an MLX-native runtime for native Multi-Token-Prediction speculative decoding on Apple Silicon. Up to 2.24× faster decode at real coding temperatures (temp=0.6 / top_p=0.95 / top_k=20) using the model's own built-in MTP heads — no external drafter, no greedy hack.
pip install mtplx
mtplx start
Project: github.com/youssofal/MTPLX
Other MTPLX checkpoints:
- Qwen3.6-27B-MTPLX-Optimized-Speed — 4-bit flagship speed (63 TPS on M5 Max)
- Qwen3.5-4B-MTPLX-Optimized-Speed — small 4-bit speed-test
- Qwen3.5-4B-Optimized-MTPLX — small 8-bit
This artifact pairs the Qwen3.6-27B trunk — MLX-quantized with MTPLX's gdn8-speed4 policy (8-bit Gated Delta Network linears, 4-bit MLP, BF16 norms) — with a calibrated INT4 Multi-Token-Prediction sidecar grafted onto the trunk. The MTP head is what enables native speculative decoding: the model drafts its own tokens, with no external draft model required.
MTPLX accepts those draft tokens with mathematically exact probability-ratio acceptance and residual correction, so the speculative path stays distribution-preserving at realistic coding settings (temperature=0.6, top_p=0.95, top_k=20) — not just greedy.
You can also:
- Inspect the architecture and MTP tensors with any
safetensorsreader. - Use the trunk weights with
mlx-lmfor ordinary autoregressive decoding (the MTP head is sidecar-only and ignored bymlx-lm). - Read the calibration / quantization metadata in
mtplx_runtime.jsonandconfig.jsonto understand the build.
What's in this checkpoint
| Component | Format |
|---|---|
| Trunk text + vision weights | MLX-affine mixed-precision: 8-bit Gated Delta Network linears, 4-bit MLP linears, BF16 norms |
MTP head sidecar (mtp.safetensors) |
Calibrated CyanKiwi prequantized INT4 with BF16 MTP norms |
Vision encoder (model-vision-*.safetensors) |
BF16, intact for multimodal use |
Runtime contract (mtplx_runtime.json) |
Pins architecture, recommended profile, and exactness baseline |
| Tokenizer + chat template | Qwen3.6 vocabulary (248k tokens) |
The MTP head is grafted from a separately calibrated INT4 sidecar (Qwen3.6-27B-MTPLX-CyanKiwi-Packed-BF16-INT4-v3) onto the MTPLX-specific GDN8-Speed4 trunk. This combination outperforms BF16 MTP on D2/D3/D4 acceptance under MTPLX's committed-history cache contract.
MTP draft acceptance
These numbers describe the MTP head's draft quality — a property of the model itself, independent of any runtime's wall-clock throughput. Per-position acceptance under exact probability-ratio sampling at temperature=0.6, top_p=0.95, top_k=20:
| Depth | This checkpoint | vLLM MTP-5 oracle (3090, same temp) |
|---|---|---|
| 1 | 97.62% | 92.7% |
| 2 | 95.24% | 77.0% |
| 3 | 88.10% | 63.0% |
| 4 | 75.61% | 50.9% |
| 5 | — | 43.0% |
Higher acceptance at every depth than vLLM's MTP-5 implementation on the same Qwen3.6 family, measured on long_code 192-token prompts.
Provenance
- Base model:
Qwen/Qwen3.6-27B(Apache 2.0). - Quantization policy:
mtplx-gdn8-speed4— MLX-affine mixed-precision with uniform 8-bit GDN linears, 4-bit MLP, 4-bitlm_head, BF16 norms and the MTP head'sfcprojection. - MTP sidecar:
cyankiwi-calibrated-int4-prequantized, calibrated separately with MLX-affine quantization and grafted onto the GDN8-Speed4 trunk. - Runtime contract:
mtplx_runtime.jsonpins the architecture (qwen3-next-mtp), recommended profile, and exactness baseline.
Limitations
- The MTPLX runtime is not yet released. Without it, you can still use the trunk weights with
mlx-lmfor ordinary AR decoding — but the MTP draft path that this checkpoint was built for requires MTPLX. - Apple Silicon focus. MTPLX targets MLX as its primary backend; CUDA / x86 are not supported.
- Verified architecture is Qwen3-Next. MTPLX recognizes other MTP architectures (DeepSeek V3 MTP, GLM4 MoE MTP, MiMo, MiniMax M2 MTP, etc.) but only Qwen3-Next-class artifacts have a verified runtime contract today.
License
This checkpoint is released under the Apache License 2.0, matching the Qwen3.6-27B base model.
Citation
@misc{mtplx2026,
author = {Youssof Al},
title = {MTPLX: Native MTP speculative decoding on Apple Silicon},
year = {2026},
howpublished = {\url{https://github.com/youssofal/mtplx}}
}
Links
- Runtime: github.com/youssofal/MTPLX ·
pip install mtplx - Base model: Qwen/Qwen3.6-27B
- Downloads last month
- 1,089
4-bit
Model tree for Youssofal/Qwen3.6-27B-MTPLX-Optimized
Base model
Qwen/Qwen3.6-27B