Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called PARQ (paper). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with ExecuTorch.

We provide the quantized pte for direct use in ExecuTorch. (The provided pte file is exported with a max_context_length of 1024. If you wish to change this, re-export the quantized model following the instructions in Exporting to ExecuTorch.)

Running in a Mobile App

The pte file can be run with ExecuTorch on a mobile phone. See the instructions for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/second and uses 1453 Mb of memory.

Quantization Recipe

Install uv by following https://docs.astral.sh/uv/getting-started/installation

uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao

QAT Finetuning with PARQ

We apply QAT with an optimizer-only package called PARQ. The below script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Do the following before running it:

  1. curl -O https://huggingface.co/datasets/pytorch/parq-sft/resolve/main/qat_sft.py
  2. Set dataset_name to your desired dataset from the HuggingFace datasets hub in addition to max_steps.
source ~/.uv-hf/bin/activate

SEED=$RANDOM
SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}

dataset_name=<TODO>
max_steps=<TODO>
ngpu=8
device_batch_size=4
grad_accum_steps=2
lr=3e-5
TOKENIZERS_PARALLELISM=$(( ngpu == 1 ))  \
  PYTORCH_ALLOC_CONF=expandable_segments:True \
  torchrun \
  --nproc-per-node $ngpu \
  --rdzv-id $SEED \
  --rdzv-backend c10d \
  --rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \
  -m qat_sft \
  --model_name_or_path microsoft/Phi-4-mini-instruct \
  --bf16 true \
  --num_train_epochs 1 \
  --per_device_train_batch_size $device_batch_size \
  --gradient_accumulation_steps $grad_accum_steps \
  --dataset_name $dataset_name \
  --dataloader_num_workers 4 \
  --max_length 4096 \
  --max_steps $max_steps \
  --report_to tensorboard \
  --learning_rate $lr \
  --lr_scheduler_type linear \
  --warmup_ratio 0.0 \
  --seed $SEED \
  --output_dir $SAVE_DIR \
  --weight_bits 2 \
  --linear_pat 'proj\.weight$' \
  --embed_bits 4 \
  --embed_pat '(lm_head|embed_tokens)'

To export the finetuned model, rerun the above script on a single GPU with --resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}. The exported model will be saved to ${SAVE_DIR}/quant_converted.

Generation from Quantized Model

import os

from huggingface_hub import whoami, get_token
from transformers import AutoModelForCausalLM, AutoTokenizer

set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [{"role": "user", "content": prompt}]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)

start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)

Model Quality

We rely on lm-evaluation-harness to evaluate the quality of the quantized model.

Evaluation command for below table:

lm_eval \
  --model hf \
  --model_args pretrained=$SAVE_DIR,dtype=auto \
  --tasks arc_easy,arc_challenge,boolq,hellaswag,mathqa,openbookqa,piqa,social_iqa,winogrande \
  --output_path ${SAVE_DIR}/eval_results.json \
  --batch_size auto \
  --trust_remote_code

Note: exact numbers may vary slightly based on your machine's chosen batch size.

Phi-4-mini-instruct 4-bit PTQ 2-bit QAT
arc_easy 80.30 74.28 68.98
arc_challenge 58.45 52.65 43.17
boolq 83.46 69.11 71.50
hellaswag 72.76 68.97 62.10
mathqa 41.27 38.12 32.76
openbookqa 41.80 39.80 38.40
piqa 78.29 76.22 73.83
social_iqa 49.64 45.55 46.93
winogrande 71.51 68.67 64.48

Exporting to ExecuTorch

⚠️ Note: These instructions only work on Arm-based machines. Running them on x86_64 will fail.

We can run the 2-bit quantized model on a mobile phone using ExecuTorch, the PyTorch solution for mobile deployment.

To set up ExecuTorch with TorchAO lowbit kernels, run the following commands:

git clone https://github.com/pytorch/executorch.git
pushd executorch
git submodule update --init --recursive
python install_executorch.py
USE_CPP=1 TORCHAO_BUILD_KLEIDIAI=1 pip install third-party/ao
popd

(The above command works on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing third-party/ao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).

Now we export the model to ExecuTorch, using the TorchAO lowbit kernel backend. (Do not run these commands from a directory containing the ExecuTorch repo you cloned during setup, or python will use the local paths in the repo instead of the installed paths.)

# 1. Download QAT'd weights from HF
HF_DIR=pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
WEIGHT_DIR=$(hf download ${HF_DIR})

# 2. Rename the weight keys to ones that ExecuTorch expects
python -m executorch.examples.models.phi_4_mini.convert_weights $WEIGHT_DIR pytorch_model_converted.bin

# 3. Download model config from the ExecuTorch repo
curl -L -o phi_4_mini_config.json https://raw.githubusercontent.com/pytorch/executorch/main/examples/models/phi_4_mini/config/config.json

# 4. Export the model to ExecuTorch pte file
python -m executorch.examples.models.llama.export_llama \
  --model "phi_4_mini" \
  --checkpoint pytorch_model_converted.bin \
  --params phi_4_mini_config.json \
  --output_name phi4_model_2bit.pte \
  -kv \
  --use_sdpa_with_kv_cache \
  --use-torchao-kernels \
  --max_context_length 1024 \
  --max_seq_length 256 \
  --dtype fp32 \
  --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'

# # 5. (optional) Upload pte file to HuggingFace
# hf upload ${HF_DIR} phi4_model_2bit.pte

Once you have the *.pte file, you can run it inside of our iOS demo app in a few easy steps.

Downloads last month
88
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pytorch/Phi-4-mini-instruct-parq-2w-4e-shared

Quantized
(121)
this model

Collections including pytorch/Phi-4-mini-instruct-parq-2w-4e-shared