Speaker: W. Fernando López Gavilánez.
Abstract: High-quality data labeling from specific domains is costly and human time-consuming. In this work, we propose an iterative pseudo-forced alignment algorithm for long audio files with low-quality transcriptions. The alignments are iteratively done by reducing the quantity of text to be aligned or expanding the alignment window until finding the best possible match. In addition, the algorithm is based on temporal anchors that are produced uniquely based on the confidence score of the last aligned utterance. The alignments can be filtered out by confidence score and used to search for any speech realated task.
Paper: https://arxiv.org/abs/2210.15226
HuggingFace: https://huggingface.co/Voyager1/asr-wav2vec2-commonvoice-es-finetuned-rtve