Speaker: Koral Tubia.

Abstract: Rural Spanish preserves a rich dialectal diversity that has received little attention form a computational point of view, partly because most speech processing systems are trained on standard, urban speech. In this work, we use the COSER corpus (Corpus Oral y Sonoro del Español Rural) to ask a simple question: do audio embeddings contain enough information to automatically detect and group accents from nearby regions? We evaluate seven widely used embedding extractors (Whisper, ECAPA-TDNN, Lang-ID, WavLM, XLS-R and CommonAccent) through unsupervised clustering and supervised classification. Results show that embeddings do capture some dialectal information, although this is heavily mixed with the speaker’s own identity, with XLS-R achieving the best performance overall. As an additional finding, we also show how speaker embeddings can automatically separate informants from interviewers, which allows the training corpus to be expanded without manual annotation.