Automatic Speech Recognition and Speaker Diarization in Spanish: Dialectal Varieties and Rural Speech

Speaker: Laura Herrera Alarcón

Abstract: Audio databases provide essential linguistic information, capturing verbal content as well as sociolinguistic features. The COSER corpus (Audible Corpus of Spoken Rural Spanish) is introduced. In addition, training and test splits have been created to facilitate its use in Automatic Speech Recognition (ASR) and Speaker Diarization. The need for validation and refinement of existing segments is highlighted to ensure reliable evaluation and effective model training. To this end, an annotation interface based on the Potato Annotation framework has been developed to facilitate this process. Moreover, initial results from ASR (Whisper and WhisperX) and diarization (Pyannote.audio) models reveal the challenges associated with non-standard speech. Continuing this work is essential to advance research on underrepresented rural varieties of Spanish and to further expand the corpus.