VIDEOP2R: Video Understanding from Perception to Reasoning Paper • 2511.11113 • Published Nov 14, 2025 • 112
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence Paper • 2510.24693 • Published Oct 28, 2025 • 18
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation Paper • 2510.00515 • Published Oct 1, 2025 • 39
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs Paper • 2509.22220 • Published Sep 26, 2025 • 65
Beyond Transcription: Mechanistic Interpretability in ASR Paper • 2508.15882 • Published Aug 21, 2025 • 86
Representing Speech Through Autoregressive Prediction of Cochlear Tokens Paper • 2508.11598 • Published Aug 15, 2025 • 17
BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing Paper • 2506.17450 • Published Jun 20, 2025 • 64
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM Paper • 2503.04724 • Published Mar 6, 2025 • 72
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference Paper • 2502.18411 • Published Feb 25, 2025 • 74
Slamming: Training a Speech Language Model on One GPU in a Day Paper • 2502.15814 • Published Feb 19, 2025 • 69
Soundwave: Less is More for Speech-Text Alignment in LLMs Paper • 2502.12900 • Published Feb 18, 2025 • 86
AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting Paper • 2502.05176 • Published Feb 7, 2025 • 39
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis Paper • 2502.04128 • Published Feb 6, 2025 • 27
view article Article Recipe: Preparing Multilingual Speech Datasets for TTS Training Nov 4, 2024 • 20
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction Paper • 2501.06282 • Published Jan 10, 2025 • 52
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching Paper • 2410.06885 • Published Oct 9, 2024 • 46