How speaker diarization evolved recently: from clustering to end-to-end approaches

Speaker: Alicia Lozano Díez.

Abstract: Speaker diarization systems aim to segment a multi-speaker audio recording according to speaker changes, providing the time stamps of the activity of each speaker, including segments where nobody speaks and those where more than one speaker is talking (overlapped speech). The generalization of the systems to perform well in very different acoustic domains is a challenge and a hot topic in research nowadays. Recently, end-to-end neural diarization (EEND) systems have emerged to address the task, optimizing the model as a whole, in contrast to modular systems based on clustering of speaker embeddings, where each module is independently optimized. In this talk, I will review the main advances on speaker diarization, focusing on EEND approaches, including our most recent work on this line, our participation in some speaker diarization challenges such as DIHARD and CHiME, and the ongoing work and current challenges we are facing.