Instructions to use froggeric/Mistral-Medium-3.5-128B-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use froggeric/Mistral-Medium-3.5-128B-MLX-4bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("froggeric/Mistral-Medium-3.5-128B-MLX-4bit") config = load_config("froggeric/Mistral-Medium-3.5-128B-MLX-4bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi
How to use froggeric/Mistral-Medium-3.5-128B-MLX-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "froggeric/Mistral-Medium-3.5-128B-MLX-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "froggeric/Mistral-Medium-3.5-128B-MLX-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use froggeric/Mistral-Medium-3.5-128B-MLX-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "froggeric/Mistral-Medium-3.5-128B-MLX-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default froggeric/Mistral-Medium-3.5-128B-MLX-4bit
Run Hermes
hermes
Config fix applied (2026-05-02). The original Mistral source had an incorrect
mscale_all_dim=1.0in the YaRN RoPE config that disabled long-context scaling, degrading performance beyond 4096 tokens. Fixed tomscale_all_dim=0.0per mistralai's commit. No weights were affected — this is a config-only fix.
Why this model?
Mistral Medium 3.5 is Mistral AI's first flagship merged model — a dense 128B parameter model replacing Mistral Medium 3.1, Magistral, and Devstral 2. It handles instruction-following, reasoning, and coding in a single set of weights with a 256K context window.
What makes this conversion different:
Vision included. The Pixtral vision encoder is kept at full precision (not quantized), so image understanding is identical to the original FP8 model.
Thinking mode. Set
reasoning_effort="high"for complex tasks — the model emits[THINK]...[/THINK]blocks with its reasoning chain before answering.Tool calling. Native function calling with
[TOOL_CALLS]tokens. Works with the included chat template.
Quick start
Text
from mlx_lm import load, generate
model, tokenizer = load("froggeric/Mistral-Medium-3.5-128B-MLX-4bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=0.7)
print(response)
Vision
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("froggeric/Mistral-Medium-3.5-128B-MLX-4bit")
image = ["path/to/image.jpg"]
prompt = "Describe this image."
formatted = apply_chat_template(processor, model.config, prompt, num_images=len(image))
result = generate(model, processor, formatted, image, max_tokens=256, temp=0.7)
print(result.text)
CLI
# Text
mlx_lm.generate \
--model froggeric/Mistral-Medium-3.5-128B-MLX-4bit \
--prompt "Hello"
# Vision
mlx_vlm.generate \
--model froggeric/Mistral-Medium-3.5-128B-MLX-4bit \
--image image.jpg --prompt "Describe this image"
Requirements: mlx-lm >= 0.31.2, mlx-vlm >= 0.4.4
System prompt
The model ships with a default system prompt (SYSTEM_PROMPT.txt) that includes date awareness and tool calling instructions. You can override it — the chat template injects the default only when no system message is provided.
Thinking mode
Set reasoning_effort in your chat template call:
reasoning_effort="none"— fast instant replyreasoning_effort="high"— deep reasoning with[THINK]...[/THINK]blocks
If your runtime doesn't support passing template variables (e.g. LM Studio community models), add reasoning_effort=high to your system prompt. This activates thinking mode the same way.
Sampling
From Mistral AI and Unsloth recommendations.
Reasoning mode (reasoning_effort="high")
For complex prompts, coding, research, math, and agentic usage:
| Parameter | Value |
|---|---|
| temperature | 0.7 |
| top_p | 0.95 |
| top_k | 20 |
Non-reasoning mode (reasoning_effort="none")
For fast replies, chat, extraction, and simple instructions:
| Parameter | Value |
|---|---|
| temperature | 0.0–0.7 |
| top_p | 0.8 |
Lower temperature for deterministic/factual output, higher for creative tasks. Maximum context length: 262,144.
This conversion
| Source | mistralai/Mistral-Medium-3.5-128B (FP8 static, 133.6 GB) |
| Quantization | 4-bit affine, group_size=64 (~70 GB across 15 shards) |
| Vision encoder | Unquantized (Pixtral, 48 layers, full BF16) |
| Projector + lm_head | Unquantized |
| Minimum RAM | ~72 GB (70 GB weights + overhead) |
Architecture details
| Spec | Value |
|---|---|
| Architecture | Dense — 127.7B params, all active per token |
| Text model | Ministral 3, 88 layers |
| Attention | 96 Q heads, 8 KV heads (GQA), head_dim 128 |
| FFN | intermediate_size 28672 |
| Context | 262K native (YaRN, factor 64) |
| RoPE | theta 1M, YaRN scaling |
| Vocab | 131K tokens (Tekken tokenizer) |
| Vision encoder | Pixtral — 48 layers, 1664 hidden, 16 heads |
| Image processing | 1540px max, patch_size 14, spatial_merge_size 2 |
| model_type | mistral3 |
Conversion notes
mlx-vlm's mistral3 sanitize function had a bug where it did not strip the model. prefix from model.vision_tower.* and model.multi_modal_projector.* keys in the source safetensors. This caused 438 parameters to be rejected during weight loading. The fix: replace model.vision_tower → vision_tower.vision_model (strip prefix + add nesting) and add a model.multi_modal_projector → multi_modal_projector case. Applied locally to mlx_vlm/models/mistral3/mistral3.py before conversion.
Credits
| Role | Author |
|---|---|
| Original model | Mistral AI |
| MLX 4-bit conversion | froggeric |
License
Modified MIT License — free for commercial and non-commercial use, with a revenue cap of $20M/month. See the full license.
- Downloads last month
- 2,094
4-bit
Model tree for froggeric/Mistral-Medium-3.5-128B-MLX-4bit
Base model
mistralai/Mistral-Medium-3.5-128B