Instructions to use machinadeusex/Qwen3.5-4B-OptiQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use machinadeusex/Qwen3.5-4B-OptiQ-4bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("machinadeusex/Qwen3.5-4B-OptiQ-4bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

How to use machinadeusex/Qwen3.5-4B-OptiQ-4bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "machinadeusex/Qwen3.5-4B-OptiQ-4bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "machinadeusex/Qwen3.5-4B-OptiQ-4bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use machinadeusex/Qwen3.5-4B-OptiQ-4bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "machinadeusex/Qwen3.5-4B-OptiQ-4bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default machinadeusex/Qwen3.5-4B-OptiQ-4bit

Run Hermes

hermes

MLX LM

How to use machinadeusex/Qwen3.5-4B-OptiQ-4bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "machinadeusex/Qwen3.5-4B-OptiQ-4bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "machinadeusex/Qwen3.5-4B-OptiQ-4bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "machinadeusex/Qwen3.5-4B-OptiQ-4bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3.5-4B-OptiQ-4bit

Mixed-precision quantized with OptiQ — sensitivity-driven quantization for Apple Silicon

This is a mixed-precision quantized version of Qwen/Qwen3.5-4B in MLX format. Unlike uniform quantization (all layers at the same bit-width), OptiQ measures each layer's sensitivity and assigns optimal per-layer bit-widths, preserving model quality where it matters most.

How OptiQ Works

OptiQ is an optimizing compiler that converts PyTorch models into hardware-optimized MLX versions using data-driven mixed-precision quantization:

Sensitivity Analysis — For each layer, OptiQ simulates quantization at each candidate bit-width and measures the KL divergence between the original and quantized output distributions. Layers that distort the distribution more are "sensitive."
Greedy Knapsack Optimization — Starting with all layers at the minimum bit-width, OptiQ greedily upgrades the most sensitive layers to higher precision until the target bits-per-weight budget is exhausted.
Per-Layer Bit Allocation — The result is a custom quantization config where each layer gets the bit-width that maximizes quality within the size budget. Protected layers (embeddings, final layers) are always assigned high precision.

Quantization Details

Property	Value
Target BPW	4.5
Achieved BPW	4.50
Candidate bits	4, 8
Layers at 4-bit	162
Layers at 8-bit	87
Total quantized layers	249
Group size	64
Model size	2811 MB
Uniform 4-bit size	2258 MB
Calibration data	WikiText-2 (2 samples, 128 tokens)

Benchmark Results

GSM8K (200 samples, 3-shot chain-of-thought):

Model	GSM8K Accuracy
OptiQ mixed (4.5 BPW)	81.5%
Uniform 4-bit	79.5%

OptiQ improves over uniform 4-bit by +2.0 percentage points.

Usage

This model works with standard mlx-lm — no special code needed:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-4B-OptiQ-4bit")

prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)

Requirements: mlx-lm >= 0.30.7 (for Qwen3.5 architecture support)

pip install mlx-lm>=0.30.7

Architecture

Qwen3.5 uses a hybrid attention architecture with alternating linear_attn and self_attn layers. OptiQ's sensitivity analysis identifies which layers are most sensitive to quantization error and assigns them higher precision, while less sensitive layers get 4-bit quantization to minimize model size.

Article

For more details on the methodology and results, see: Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Credits

Quantization method: OptiQ — optimizing compiler for mixed-precision quantization on Apple Silicon
Base model: Qwen/Qwen3.5-4B by Qwen Team
Runtime: MLX by Apple

Downloads last month: 112

Safetensors

Model size

0.8B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for machinadeusex/Qwen3.5-4B-OptiQ-4bit

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Quantized

(213)

this model