Integration of Emotional Information in Speaker Recognition Systems

Speaker: Arturo Domínguez Santos.

Abstract: This Master’s thesis addresses the challenge of investigating how emotions affect speaker
verification and proposes a system that integrates this emotional variability to try to
improve accuracy. The focus is on the speaker’s emotions, which has traditionally been
overlooked in conventional approaches. It is recognized that emotions can cause noticeable
changes in vocal characteristics, such as pitch, speech rate, and intonation, which
negatively affect the accuracy and reliability of speaker verification systems that usually
do not consider these variations. This study proposes an innovative method that
aims to combine speaker recognition embeddings with emotional data to try to improve
the robustness and accuracy of the system in contexts where emotional variability is
significant.

The research is based on the use of advanced deep learning models, such as ECAPATDNN
and DistilHuBERT, which have proven effective in speaker recognition and
emotion recognition tasks. These models, with different objectives, have been trained
and evaluated using datasets like VoxCeleb, known for its wide variety of voices and
conditions for speaker recognition, and RAVDESS, which contains voices from different
actors to model the analyzed emotions. The goal is to fine-tune VoxMovies based on the
model trained on VoxCeleb, which has much greater emotional variability and diversity,
and to include emotional information from the input data based on the model trained
on RAVDESS to analyze how this affects speaker verification.
The results of the study show that emotions play an important role in modeling the
voice, and interesting analyses are presented on this. However, no results improve the
state of the art due to various existing limitations. Among the main ones is the lack
of more complex databases that represent scenarios closer to those in the real world.