Automatic Speech Recognition in Dialectal Data (COSER)

Speaker: Clara Adsuar Ávila.

Abstract:

In this project, we address the importance of enhancing the accessibility and usefulness of Deep Learning technologies for non-standard speakers. From a linguistic perspective, rural Spanish areas are rich in dialectal variety. However, most technology is designed for standard language and non-realistic scenarios. To bridge this gap, we are developing an Automatic Speech Recognition (ASR) tool using OpenAI’s Whisper model, tailored for dialectal speakers with the COSER corpus. Data analysis and pre-processing are crucial steps in creating effective databases for this purpose. Additionally, we have considered the work of other universities and research groups that have developed oral dialectal corpora, exploring various approaches to address this challenge.

Two datasets were created for this project: one comprising 68 manually transcribed and synchronized interviews, and the other containing 111 manually transcribed interviews that were automatically and manually synchronized. The first dataset was used to train Whisper-small, while the second was used to train three Whisper versions: small, medium, and large. The best Word Error Rate (WER) was achieved by Whisper-large on the second dataset (WER: 63.53). However, the experiments reveal that audio quality negatively impacts transcription accuracy. On a positive note, Whisper has shown promising results in learning audible markers and new vocabulary. Finally, we present ongoing approaches aimed at improving ASR performance with dialectal data.