Configuration Parsing Warning:Config file config.json cannot be fetched (too big)
Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
AudioMCQ-Weak-to-Strong
[2026.04] Update on MMSU Metric
Based on community feedback, we identified a flaw in our evaluation script that artificially inflated the MMSU scores of our released models by ignoring sequence order. We sincerely apologize for this oversight. Crucially, please note that our AudioMCQ training data, the paper's conclusions regarding audio-contribution, and the MMAR/MMAU metrics remain completely unaffected. When comparing against our work, we recommend reporting the MMAR/MMAU results or re-evaluating our published checkpoints using your own exact-match algorithm. We deeply apologize for any inconvenience this may have caused to the research community.
Overview
This repository contains the Weak-to-Strong model checkpoint from our paper "Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models". This model demonstrates state-of-the-art performance on audio question-answering benchmarks through our novel audio-contribution-aware post-training approach.
Training Paradigm
The Weak-to-Strong training paradigm follows a two-stage approach:
Stage 1: SFT on weak audio-contribution data
Stage 2: GRPO (RL) on strong audio-contribution data
This paradigm begins with supervised fine-tuning on samples with weak audio contribution (where visual or textual cues provide substantial information), then applies reinforcement learning on challenging strong audio-contribution samples to enhance audio-specific understanding capabilities.
Model Details
- Base Model: Qwen2.5-Omni
- Training Data: AudioMCQ Dataset (571k samples)
- Training Stages:
- Stage 1 (SFT): Weak audio-contribution subset
- Stage 2 (GRPO): Strong audio-contribution subset
- System Prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."
Usage
Our model loading and usage methods are identical to those of Qwen2.5-Omni. Please refer to the official documentation.
Input Format (Updated on 2026-03-08)
The evaluation input prompt structure is:
[Question] Please choose the answer from the following options: ['Option1', 'Option2', 'Option3', 'Option4']. Output the final answer in <answer> </answer>.
Example Usage
# Load model following Qwen2.5-Omni documentation
# Apply system prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."
# Format your question with the input structure above
Performance
The Weak-to-Strong model achieves competitive performance across multiple benchmarks:
- MMAU-test-mini: Strong accuracy on general audio understanding
- MMAR: Robust performance on music understanding tasks
- MMSU: Solid results on speech understanding
- Strong Audio-Contribution Splits: Enhanced performance on challenging samples requiring deep audio understanding
For detailed performance metrics and comparisons, please refer to our paper.
Related Resources
- AudioMCQ Dataset: https://huggingface.co/datasets/inclusionAI/AudioMCQ
- Mixed-to-Strong Checkpoint: https://huggingface.co/inclusionAI/AudioMCQ-Mixed-To-Strong
- Paper: arXiv:2509.21060
- DCASE 2025 Challenge: http://dcase.community/challenge2025/
Citation
If you find this model useful in your research, please cite:
@inproceedings{he2025audiomcq,
title={Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models},
author={He, Haolin and others},
booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2026}
}
Contact
- Haolin He: harlandzzc@link.cuhk.edu.hk
Acknowledgements
We thank the organizers of DCASE 2025 and the research community for their valuable feedback and support.
- Downloads last month
- 41