Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called PARQ (paper). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with ExecuTorch.
We provide the quantized pte for direct use in ExecuTorch. (The provided pte file is exported with a max_context_length of 1024. If you wish to change this, re-export the quantized model following the instructions in Exporting to ExecuTorch.)
Running in a Mobile App
The pte file can be run with ExecuTorch on a mobile phone. See the instructions for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/second and uses 1453 Mb of memory.
Quantization Recipe
Install uv by following https://docs.astral.sh/uv/getting-started/installation
uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao
QAT Finetuning with PARQ
We apply QAT with an optimizer-only package called PARQ. The below script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Do the following before running it:
curl -O https://huggingface.co/datasets/pytorch/parq-sft/resolve/main/qat_sft.py- Set
dataset_nameto your desired dataset from the HuggingFace datasets hub in addition tomax_steps.
source ~/.uv-hf/bin/activate
SEED=$RANDOM
SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}
dataset_name=<TODO>
max_steps=<TODO>
ngpu=8
device_batch_size=4
grad_accum_steps=2
lr=3e-5
TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
PYTORCH_ALLOC_CONF=expandable_segments:True \
torchrun \
--nproc-per-node $ngpu \
--rdzv-id $SEED \
--rdzv-backend c10d \
--rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \
-m qat_sft \
--model_name_or_path microsoft/Phi-4-mini-instruct \
--bf16 true \
--num_train_epochs 1 \
--per_device_train_batch_size $device_batch_size \
--gradient_accumulation_steps $grad_accum_steps \
--dataset_name $dataset_name \
--dataloader_num_workers 4 \
--max_length 4096 \
--max_steps $max_steps \
--report_to tensorboard \
--learning_rate $lr \
--lr_scheduler_type linear \
--warmup_ratio 0.0 \
--seed $SEED \
--output_dir $SAVE_DIR \
--weight_bits 2 \
--linear_pat 'proj\.weight$' \
--embed_bits 4 \
--embed_pat '(lm_head|embed_tokens)'
To export the finetuned model, rerun the above script on a single GPU with --resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}. The exported model will be saved to ${SAVE_DIR}/quant_converted.
Generation from Quantized Model
import os
from huggingface_hub import whoami, get_token
from transformers import AutoModelForCausalLM, AutoTokenizer
set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [{"role": "user", "content": prompt}]
templated_prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)
start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)
Model Quality
We rely on lm-evaluation-harness to evaluate the quality of the quantized model.
Evaluation command for below table:
lm_eval \
--model hf \
--model_args pretrained=$SAVE_DIR,dtype=auto \
--tasks arc_easy,arc_challenge,boolq,hellaswag,mathqa,openbookqa,piqa,social_iqa,winogrande \
--output_path ${SAVE_DIR}/eval_results.json \
--batch_size auto \
--trust_remote_code
Note: exact numbers may vary slightly based on your machine's chosen batch size.
| Phi-4-mini-instruct | 4-bit PTQ | 2-bit QAT | |
|---|---|---|---|
| arc_easy | 80.30 | 74.28 | 68.98 |
| arc_challenge | 58.45 | 52.65 | 43.17 |
| boolq | 83.46 | 69.11 | 71.50 |
| hellaswag | 72.76 | 68.97 | 62.10 |
| mathqa | 41.27 | 38.12 | 32.76 |
| openbookqa | 41.80 | 39.80 | 38.40 |
| piqa | 78.29 | 76.22 | 73.83 |
| social_iqa | 49.64 | 45.55 | 46.93 |
| winogrande | 71.51 | 68.67 | 64.48 |
Exporting to ExecuTorch
⚠️ Note: These instructions only work on Arm-based machines. Running them on x86_64 will fail.
We can run the 2-bit quantized model on a mobile phone using ExecuTorch, the PyTorch solution for mobile deployment.
To set up ExecuTorch with TorchAO lowbit kernels, run the following commands:
git clone https://github.com/pytorch/executorch.git
pushd executorch
git submodule update --init --recursive
python install_executorch.py
USE_CPP=1 TORCHAO_BUILD_KLEIDIAI=1 pip install third-party/ao
popd
(The above command works on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing third-party/ao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).
Now we export the model to ExecuTorch, using the TorchAO lowbit kernel backend. (Do not run these commands from a directory containing the ExecuTorch repo you cloned during setup, or python will use the local paths in the repo instead of the installed paths.)
# 1. Download QAT'd weights from HF
HF_DIR=pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
WEIGHT_DIR=$(hf download ${HF_DIR})
# 2. Rename the weight keys to ones that ExecuTorch expects
python -m executorch.examples.models.phi_4_mini.convert_weights $WEIGHT_DIR pytorch_model_converted.bin
# 3. Download model config from the ExecuTorch repo
curl -L -o phi_4_mini_config.json https://raw.githubusercontent.com/pytorch/executorch/main/examples/models/phi_4_mini/config/config.json
# 4. Export the model to ExecuTorch pte file
python -m executorch.examples.models.llama.export_llama \
--model "phi_4_mini" \
--checkpoint pytorch_model_converted.bin \
--params phi_4_mini_config.json \
--output_name phi4_model_2bit.pte \
-kv \
--use_sdpa_with_kv_cache \
--use-torchao-kernels \
--max_context_length 1024 \
--max_seq_length 256 \
--dtype fp32 \
--metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'
# # 5. (optional) Upload pte file to HuggingFace
# hf upload ${HF_DIR} phi4_model_2bit.pte
Once you have the *.pte file, you can run it inside of our iOS demo app in a few easy steps.
- Downloads last month
- 88
Model tree for pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
Base model
microsoft/Phi-4-mini-instruct