Instructions to use machinadeusex/Qwen3.5-4B-OptiQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use machinadeusex/Qwen3.5-4B-OptiQ-4bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("machinadeusex/Qwen3.5-4B-OptiQ-4bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi
How to use machinadeusex/Qwen3.5-4B-OptiQ-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "machinadeusex/Qwen3.5-4B-OptiQ-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "machinadeusex/Qwen3.5-4B-OptiQ-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use machinadeusex/Qwen3.5-4B-OptiQ-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "machinadeusex/Qwen3.5-4B-OptiQ-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default machinadeusex/Qwen3.5-4B-OptiQ-4bit
Run Hermes
hermes
- MLX LM
How to use machinadeusex/Qwen3.5-4B-OptiQ-4bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "machinadeusex/Qwen3.5-4B-OptiQ-4bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "machinadeusex/Qwen3.5-4B-OptiQ-4bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "machinadeusex/Qwen3.5-4B-OptiQ-4bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Qwen3.5-4B-OptiQ-4bit
Mixed-precision quantized with OptiQ — sensitivity-driven quantization for Apple Silicon
This is a mixed-precision quantized version of Qwen/Qwen3.5-4B in MLX format. Unlike uniform quantization (all layers at the same bit-width), OptiQ measures each layer's sensitivity and assigns optimal per-layer bit-widths, preserving model quality where it matters most.
How OptiQ Works
OptiQ is an optimizing compiler that converts PyTorch models into hardware-optimized MLX versions using data-driven mixed-precision quantization:
- Sensitivity Analysis — For each layer, OptiQ simulates quantization at each candidate bit-width and measures the KL divergence between the original and quantized output distributions. Layers that distort the distribution more are "sensitive."
- Greedy Knapsack Optimization — Starting with all layers at the minimum bit-width, OptiQ greedily upgrades the most sensitive layers to higher precision until the target bits-per-weight budget is exhausted.
- Per-Layer Bit Allocation — The result is a custom quantization config where each layer gets the bit-width that maximizes quality within the size budget. Protected layers (embeddings, final layers) are always assigned high precision.
Quantization Details
| Property | Value |
|---|---|
| Target BPW | 4.5 |
| Achieved BPW | 4.50 |
| Candidate bits | 4, 8 |
| Layers at 4-bit | 162 |
| Layers at 8-bit | 87 |
| Total quantized layers | 249 |
| Group size | 64 |
| Model size | 2811 MB |
| Uniform 4-bit size | 2258 MB |
| Calibration data | WikiText-2 (2 samples, 128 tokens) |
Benchmark Results
GSM8K (200 samples, 3-shot chain-of-thought):
| Model | GSM8K Accuracy |
|---|---|
| OptiQ mixed (4.5 BPW) | 81.5% |
| Uniform 4-bit | 79.5% |
OptiQ improves over uniform 4-bit by +2.0 percentage points.
Usage
This model works with standard mlx-lm — no special code needed:
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3.5-4B-OptiQ-4bit")
prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)
Requirements: mlx-lm >= 0.30.7 (for Qwen3.5 architecture support)
pip install mlx-lm>=0.30.7
Architecture
Qwen3.5 uses a hybrid attention architecture with alternating linear_attn and self_attn layers. OptiQ's sensitivity analysis identifies which layers are most sensitive to quantization error and assigns them higher precision, while less sensitive layers get 4-bit quantization to minimize model size.
Article
For more details on the methodology and results, see: Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon
Credits
- Quantization method: OptiQ — optimizing compiler for mixed-precision quantization on Apple Silicon
- Base model: Qwen/Qwen3.5-4B by Qwen Team
- Runtime: MLX by Apple
- Downloads last month
- 112
4-bit