Qwen3.6-27B MTPLX Optimized

Run this with MTPLX

MTPLX is an MLX-native runtime for native Multi-Token-Prediction speculative decoding on Apple Silicon. Up to 2.24× faster decode at real coding temperatures (temp=0.6 / top_p=0.95 / top_k=20) using the model's own built-in MTP heads — no external drafter, no greedy hack.

pip install mtplx
mtplx start

Project: github.com/youssofal/MTPLX

Other MTPLX checkpoints:


This artifact pairs the Qwen3.6-27B trunk — MLX-quantized with MTPLX's gdn8-speed4 policy (8-bit Gated Delta Network linears, 4-bit MLP, BF16 norms) — with a calibrated INT4 Multi-Token-Prediction sidecar grafted onto the trunk. The MTP head is what enables native speculative decoding: the model drafts its own tokens, with no external draft model required.

MTPLX accepts those draft tokens with mathematically exact probability-ratio acceptance and residual correction, so the speculative path stays distribution-preserving at realistic coding settings (temperature=0.6, top_p=0.95, top_k=20) — not just greedy.

You can also:

  • Inspect the architecture and MTP tensors with any safetensors reader.
  • Use the trunk weights with mlx-lm for ordinary autoregressive decoding (the MTP head is sidecar-only and ignored by mlx-lm).
  • Read the calibration / quantization metadata in mtplx_runtime.json and config.json to understand the build.

What's in this checkpoint

Component Format
Trunk text + vision weights MLX-affine mixed-precision: 8-bit Gated Delta Network linears, 4-bit MLP linears, BF16 norms
MTP head sidecar (mtp.safetensors) Calibrated CyanKiwi prequantized INT4 with BF16 MTP norms
Vision encoder (model-vision-*.safetensors) BF16, intact for multimodal use
Runtime contract (mtplx_runtime.json) Pins architecture, recommended profile, and exactness baseline
Tokenizer + chat template Qwen3.6 vocabulary (248k tokens)

The MTP head is grafted from a separately calibrated INT4 sidecar (Qwen3.6-27B-MTPLX-CyanKiwi-Packed-BF16-INT4-v3) onto the MTPLX-specific GDN8-Speed4 trunk. This combination outperforms BF16 MTP on D2/D3/D4 acceptance under MTPLX's committed-history cache contract.

MTP draft acceptance

These numbers describe the MTP head's draft quality — a property of the model itself, independent of any runtime's wall-clock throughput. Per-position acceptance under exact probability-ratio sampling at temperature=0.6, top_p=0.95, top_k=20:

Depth This checkpoint vLLM MTP-5 oracle (3090, same temp)
1 97.62% 92.7%
2 95.24% 77.0%
3 88.10% 63.0%
4 75.61% 50.9%
5 43.0%

Higher acceptance at every depth than vLLM's MTP-5 implementation on the same Qwen3.6 family, measured on long_code 192-token prompts.

Provenance

  • Base model: Qwen/Qwen3.6-27B (Apache 2.0).
  • Quantization policy: mtplx-gdn8-speed4 — MLX-affine mixed-precision with uniform 8-bit GDN linears, 4-bit MLP, 4-bit lm_head, BF16 norms and the MTP head's fc projection.
  • MTP sidecar: cyankiwi-calibrated-int4-prequantized, calibrated separately with MLX-affine quantization and grafted onto the GDN8-Speed4 trunk.
  • Runtime contract: mtplx_runtime.json pins the architecture (qwen3-next-mtp), recommended profile, and exactness baseline.

Limitations

  • The MTPLX runtime is not yet released. Without it, you can still use the trunk weights with mlx-lm for ordinary AR decoding — but the MTP draft path that this checkpoint was built for requires MTPLX.
  • Apple Silicon focus. MTPLX targets MLX as its primary backend; CUDA / x86 are not supported.
  • Verified architecture is Qwen3-Next. MTPLX recognizes other MTP architectures (DeepSeek V3 MTP, GLM4 MoE MTP, MiMo, MiniMax M2 MTP, etc.) but only Qwen3-Next-class artifacts have a verified runtime contract today.

License

This checkpoint is released under the Apache License 2.0, matching the Qwen3.6-27B base model.

Citation

@misc{mtplx2026,
  author       = {Youssof Al},
  title        = {MTPLX: Native MTP speculative decoding on Apple Silicon},
  year         = {2026},
  howpublished = {\url{https://github.com/youssofal/mtplx}}
}

Links

Downloads last month
1,089
Safetensors
Model size
27B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Youssofal/Qwen3.6-27B-MTPLX-Optimized

Base model

Qwen/Qwen3.6-27B
Quantized
(280)
this model