Representaciones de audio self-supervised Wav2Vec2 para el reconocimiento de locutor

Speaker: Laura Herrera.

Abstract: In this Final Degree Project, different speech representations, extracted by unsupervised learning, have been used to train a speaker recognition system. In particular, Wav2Vec2.0 and WavLM features have been used as a novelty. The Wav2Vec2.0 features are specifically designed for automatic speech recognition tasks, while the WavLM features bring a more general point of view, as they are designed for multiple tasks in voice technology.
For this purpose, a neural network has been used for unsupervised feature extraction. Using unsupervised or semi-supervised learning is a great advance over previous models, since it helps to solve the problem of having insufficient labelled data, being able to use a large amount of audios for training the network.
Secondly, a second neural network has been designed for automatic speaker recognition that, instead of using Mel spectrograms, uses Wav2Vec or WavLM features as input during training.
Finally, the performance of different models, trained from classical features such as Mel spectrograms, Wav2Vec2.0 or WavLM features and a combination of melgrams and Wav2Vec2.0 features, has been compared.