Instructions to use FakeRockert543/gemma-4-31b-it-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use FakeRockert543/gemma-4-31b-it-MLX-4bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("FakeRockert543/gemma-4-31b-it-MLX-4bit") config = load_config("FakeRockert543/gemma-4-31b-it-MLX-4bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi
How to use FakeRockert543/gemma-4-31b-it-MLX-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "FakeRockert543/gemma-4-31b-it-MLX-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FakeRockert543/gemma-4-31b-it-MLX-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FakeRockert543/gemma-4-31b-it-MLX-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "FakeRockert543/gemma-4-31b-it-MLX-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FakeRockert543/gemma-4-31b-it-MLX-4bit
Run Hermes
hermes
gemma-4-31b-it-MLX-4bit
PLE-safe MLX 4bit weights for Google Gemma 4 31B (31B dense) on Apple Silicon.
- 📦 Source & convert scripts: GitHub — FakeRocket543/mlx-gemma4
- 📊 Size: 20.4 GB
⚠️ Existing MLX quantized Gemma 4 models (mlx-community, unsloth) produce garbage output due to quantizing PLE (Per-Layer Embedding) layers. This repo provides working quantized weights. See Why below.
Other Precisions
All Gemma 4 MLX Models
| Model | Params | Precision | Size | Audio |
|---|---|---|---|---|
| gemma-4-e2b-it-MLX-4bit | 2.3B | 4bit | 7.1 GB | ✅ |
| gemma-4-e2b-it-MLX-8bit | 2.3B | 8bit | 8.5 GB | ✅ |
| gemma-4-e2b-it-MLX-bf16 | 2.3B | bf16 | 9.6 GB | ✅ |
| gemma-4-e4b-it-MLX-4bit | 4.5B | 4bit | 10.3 GB | ✅ |
| gemma-4-e4b-it-MLX-8bit | 4.5B | 8bit | 12.3 GB | ✅ |
| gemma-4-e4b-it-MLX-bf16 | 4.5B | bf16 | 16.0 GB | ✅ |
| gemma-4-26b-a4b-it-MLX-4bit | 26B MoE | 4bit | 16.4 GB | — |
| gemma-4-26b-a4b-it-MLX-8bit | 26B MoE | 8bit | 28.6 GB | — |
| gemma-4-26b-a4b-it-MLX-bf16 | 26B MoE | bf16 | 51.6 GB | — |
| gemma-4-31b-it-MLX-4bit | 31B dense | 4bit | 20.4 GB | — |
| gemma-4-31b-it-MLX-8bit | 31B dense | 8bit | 35.1 GB | — |
| gemma-4-31b-it-MLX-bf16 | 31B dense | bf16 | 62.5 GB | — |
Quantization Details
- Bits: 4
- Group size: 64
- Mode: affine
- Strategy: PLE-safe — only large
nn.LinearandSwitchLinear(MoE) layers are quantized. All PLE/ScaledLinear/vision/audio layers stay in bf16.
Why PLE-Safe?
Gemma 4 uses a novel PLE (Per-Layer Embeddings) architecture with ScaledLinear layers that multiply outputs by a learned scalar. Standard quantization introduces rounding error in these layers, and the scalar amplifies it — producing ionoxffionoxff... garbage.
Our fix: Only quantize the large decoder nn.Linear and SwitchLinear (MoE expert) layers. Everything else stays bf16:
| Quantized (4bit) | Kept in bf16 |
|---|---|
| Attention projections (q/k/v/o_proj) | ScaledEmbedding (embed_tokens) |
| MLP layers (gate/up/down_proj) | ScaledLinear (PLE pathway) |
| MoE expert layers (SwitchLinear) | Per-layer embeddings (per_layer_*) |
| Vision encoder | |
| All norms and scalars |
Usage
Prerequisite: Apply the ScaledLinear fix to mlx-vlm (required until PR merged upstream):
pip install mlx-vlm
# Apply fix
git clone https://github.com/FakeRocket543/mlx-gemma4.git
cp mlx-gemma4/mlx_vlm_patches/models/gemma4/language.py \
$(python -c "import mlx_vlm; print(mlx_vlm.__path__[0])")/models/gemma4/
Important: You must manually apply the chat template. mlx_vlm.generate() does not do this automatically for Gemma 4.
Vision
from mlx_vlm import load, generate
model, processor = load("FakeRockert543/gemma-4-31b-it-MLX-4bit")
tokenizer = processor.tokenizer
messages = [{"role": "user", "content": [
{"type": "image", "url": "photo.jpg"},
{"type": "text", "text": "Describe this image in detail."},
]}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, ["photo.jpg"],
max_tokens=200, repetition_penalty=1.2, temperature=0.7)
print(out.text)
Text
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, max_tokens=100, temperature=0.0)
print(out.text)
Bugs Fixed in mlx-vlm
| # | Bug | Impact | Fix |
|---|---|---|---|
| 1 | ScaledLinear inherits nn.Module not nn.Linear |
nn.quantize() can't find these layers |
Change to ScaledLinear(nn.Linear) |
| 2 | Standard quantization quantizes PLE layers | Garbage output on 4-bit/8-bit | PLE-safe class_predicate skipping PLE/vision/audio |
| 3 | processor.save_pretrained() strips feature_extractor |
Audio silently dropped | Copy processor_config.json from source |
| 4 | SwitchLinear (MoE) not quantized |
26B-A4B: 49 GB instead of 16 GB | Check hasattr(module, 'to_quantized') |
Fixed source files are included in the GitHub repo.
Convert From Source
git clone https://github.com/FakeRocket543/mlx-gemma4.git
cd mlx-gemma4
python convert_gemma4.py 31B 4
Validation
All 12 variants validated on 10 images + 3 chat prompts. Full results: GitHub.
License
Model weights: Google Gemma License. Scripts: MIT.
- Downloads last month
- 51
4-bit