Align-TI: Distilling Multimodal Large Language Models via Token Interactions

This repository contains a model distilled using Align-TI, a novel knowledge distillation (KD) framework introduced in the paper Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions.

Overview

Align-TI is designed to compress Multimodal Large Language Models (MLLMs) by focusing on dynamic token interactions rather than just static next-token alignment. It introduces two primary components:

Instruction-aware Vision Alignment (IVA): Enables the student model to imitate the teacher's ability to extract instruction-relevant visual information by aligning on salient visual regions.
Transition Probability Alignment (TPA): Captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities.

The framework achieves state-of-the-art performance for parameter-efficient MLLMs, with the 2B version even outperforming significantly larger models like LLaVA-1.5-7B.

Resources

Paper: Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions
GitHub Repository: lchen1019/Align-TI

Citation

If you find this work useful, please cite the paper:

@article{chen2026alignti,
      title={Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions}, 
      author={Lin Chen and Xiaoke Zhao and Kun Ding and Weiwei Feng and Changtao Miao and Zili Wang and Wenxuan Guo and Ying Wang and Kaiyuan Zheng and Bo Zhang and Zhe Li and Shiming Xiang},
      journal={arXiv preprint arXiv:2602.09483},
      year={2026},
}