Speaker: Guillermo Recio.
Abstract: Based on paper https://www.isca-speech.org/archive/pdfs/interspeech_2022/vaessen22_interspeech.pdf. This work considers training neural networks for speaker recognition with smaller datasets compared to contemporary work. For this purpose, they propose three subsets of the VoxCeleb2 dataset. Each of these subsets contains 50k audio files instead of 1M, like the original dataset. The number of speakers, sessions and utterances per session vary on each of these subsets. Three speaker recognition systems are trained on these subsets: the X-vector, ECAPA-TDNN and wav2vec2 network architectures. At the end, they show that the self-supervised, pre-trained weights of wav2vec2 improve the performance when training data is limited.