Configuration Parsing Warning:Config file config.json cannot be fetched (too big)

Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

AudioMCQ-Weak-to-Strong

[2026.04] Update on MMSU Metric

Based on community feedback, we identified a flaw in our evaluation script that artificially inflated the MMSU scores of our released models by ignoring sequence order. We sincerely apologize for this oversight. Crucially, please note that our AudioMCQ training data, the paper's conclusions regarding audio-contribution, and the MMAR/MMAU metrics remain completely unaffected. When comparing against our work, we recommend reporting the MMAR/MMAU results or re-evaluating our published checkpoints using your own exact-match algorithm. We deeply apologize for any inconvenience this may have caused to the research community.

Overview

This repository contains the Weak-to-Strong model checkpoint from our paper "Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models". This model demonstrates state-of-the-art performance on audio question-answering benchmarks through our novel audio-contribution-aware post-training approach.

Training Paradigm

The Weak-to-Strong training paradigm follows a two-stage approach:

Stage 1: SFT on weak audio-contribution data
Stage 2: GRPO (RL) on strong audio-contribution data

This paradigm begins with supervised fine-tuning on samples with weak audio contribution (where visual or textual cues provide substantial information), then applies reinforcement learning on challenging strong audio-contribution samples to enhance audio-specific understanding capabilities.

Model Details

Base Model: Qwen2.5-Omni
Training Data: AudioMCQ Dataset (571k samples)
Training Stages:
- Stage 1 (SFT): Weak audio-contribution subset
- Stage 2 (GRPO): Strong audio-contribution subset
System Prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."

Usage

Our model loading and usage methods are identical to those of Qwen2.5-Omni. Please refer to the official documentation.

Input Format (Updated on 2026-03-08)

The evaluation input prompt structure is:

[Question] Please choose the answer from the following options: ['Option1', 'Option2', 'Option3', 'Option4']. Output the final answer in <answer> </answer>.

Example Usage

# Load model following Qwen2.5-Omni documentation
# Apply system prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."
# Format your question with the input structure above

Performance

The Weak-to-Strong model achieves competitive performance across multiple benchmarks:

MMAU-test-mini: Strong accuracy on general audio understanding
MMAR: Robust performance on music understanding tasks
MMSU: Solid results on speech understanding
Strong Audio-Contribution Splits: Enhanced performance on challenging samples requiring deep audio understanding

For detailed performance metrics and comparisons, please refer to our paper.

Related Resources

AudioMCQ Dataset: https://huggingface.co/datasets/inclusionAI/AudioMCQ
Mixed-to-Strong Checkpoint: https://huggingface.co/inclusionAI/AudioMCQ-Mixed-To-Strong
Paper: arXiv:2509.21060
DCASE 2025 Challenge: http://dcase.community/challenge2025/

Citation

If you find this model useful in your research, please cite:

@inproceedings{he2025audiomcq,
  title={Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models},
  author={He, Haolin and others},
  booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2026}
}

Contact

Haolin He: harlandzzc@link.cuhk.edu.hk

Acknowledgements

We thank the organizers of DCASE 2025 and the research community for their valuable feedback and support.

Downloads last month: 41

Safetensors

Model size

11B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for inclusionAI/AudioMCQ-Weak-To-Strong

Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

Paper • 2509.21060 • Published Sep 25, 2025