Exploring Speech Foundation Models for End-to-End Speaker Diarization

Speaker: Laura Herrera Alarcón.

Abstract: In this Master’s Thesis the use of pre-trained models for the diarization task has been
studied in order to exploit their ability to extract robust and discriminative features.
In particular, the WavLM model has been combined with an end-to-end diarization
model corresponding to the current state of the art, since this model has proved its
effectiveness in other speech technology applications such as speech recognition.
For this purpose, a system that combines the output of different hidden layers of the
WavLM model using a back-end has been implemented. This combination was then
used as input to the diarization model, providing the system with the most appropriate
information for the particular task. The selection of using a lightweight back-end and
not jointly retraining both models is explained by the high computational costs required
for this task.

To this end, two approaches have been implemented, in which the method for
combining the layers of the WavLM model is modified. First, a back-end has been
implemented with the purpose of performing a weighted summation whose trainable
weights allowed the system to select the most relevant layers. Second, a Multi Head
Factorized Attentive (MHFA) approach has been implemented, which uses an attention
mechanism to consider the information belonging to the different subspaces.
The experiments performed have allowed to analyze the effectiveness of the implemented
system, being the weighted sum the approach with the best results.