Speaker: Laura Herrera Alarcón.
Abstract: The emergence of Large Audio Language Models (LALMs) has expanded the ability of LLMs to understand and reason over audio. In response, new benchmarks have been introduced to measure these capabilities. Yet, most rely on multiple-choice formats, offering only a limited view of model performance in real-world, open-ended scenarios.
This presentation introduces MMAU-Pro, a benchmark designed to address these shortcomings. We highlight its key features, including realistic enviroments, extended audio durations, multi-hop reasoning tasks, and diverse coverage of domains and skills. Moving beyond multiple-choice, we also examine the use of LLM-as-a-Judge for evaluating open-ended responses, yielding a more reliable and comprehensive approach to evaluating LALMs.