Speaker: Sergio Álvarez Balanya
Abstract: End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet the reliability of these confidence scores remains largely unexplored. Unlike hard-decision fusion approaches such as DOVER-Lap, working with continuous probability outputs enables more sophisticated calibration and fusion techniques that can leverage model uncertainty and confidence information. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations—multilabel and powerset representations—and their impact on calibration and fusion effectiveness. Through extensive experiments on CallHome, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, and that the Fuse-then-Calibrate ordering generally outperforms calibrating individual models before fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work establishes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.