Speaker: Juan Ignacio Álvarez Trejos.
Abstract:
The task of speaker diarization has lately been successfully tackled with end-to-end neural diarization (EEND) models instead of modular cascaded ones. Among them, the very new EEND Perceiver-based attractors (DiaPer) comes with a light architecture and promising results on datasets with various speakers. This work focuses on the adaptation of the DiaPer
to short segments (5 min) from the challenging Albayzin 2022 database. We explore finetuning with different segment lengthsand several data augmentation techniques. The improvementsachieved are detailed and analyzed according to the type of audios in the corpus. Finally we compare the results with two state-of-the-art models, VBx and pyannote.audio. With apply-
ing respectively fine-tuning and data augmentation, we succeed in successively decreasing the DER of the raw DiaPer by 30.55% and 3.75%.