Speaker: Laura Herrera Alarcón.
Abstract:
This study introduces an algorithm to match predicted speaker labels from short audio segments into a final prediction. This involves extracting an x-vector for each speaker in each segment and applying constrained Agglomerative Clustering to these embeddings. The RTVE2022 dataset, which poses significant challenges for both traditional cascaded and end-to-end models, is used for analysis. The algorithm enables the use of end-to-end models even if they haven’t been trained on datasets with as many speakers, leveraging their strengths on shorter segments. To test its efficacy, the VBx, DiaPer, and Pyannote models were compared. Both VBx and DiaPer performed better on short segments, achieving a relative improvement in DER of 13.2% with VBx and 48.75% with DiaPer after matching. In the case of Pyannote, the algorithm did not improve performance, as the model already implements a similar process.