Align-TI: Distilling Multimodal Large Language Models via Token Interactions

This repository contains a model distilled using Align-TI, a novel knowledge distillation (KD) framework introduced in the paper Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions.

Overview

Align-TI is designed to compress Multimodal Large Language Models (MLLMs) by focusing on dynamic token interactions rather than just static next-token alignment. It introduces two primary components:

  • Instruction-aware Vision Alignment (IVA): Enables the student model to imitate the teacher's ability to extract instruction-relevant visual information by aligning on salient visual regions.
  • Transition Probability Alignment (TPA): Captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities.

The framework achieves state-of-the-art performance for parameter-efficient MLLMs, with the 2B version even outperforming significantly larger models like LLaVA-1.5-7B.

Resources

Citation

If you find this work useful, please cite the paper:

@article{chen2026alignti,
      title={Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions}, 
      author={Lin Chen and Xiaoke Zhao and Kun Ding and Weiwei Feng and Changtao Miao and Zili Wang and Wenxuan Guo and Ying Wang and Kaiyuan Zheng and Bo Zhang and Zhe Li and Shiming Xiang},
      journal={arXiv preprint arXiv:2602.09483},
      year={2026},
}
Downloads last month
10
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lchen1019/LlavaQwen2-Align-TI-1B

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(527)
this model

Collection including lchen1019/LlavaQwen2-Align-TI-1B

Paper for lchen1019/LlavaQwen2-Align-TI-1B