Speaker: William Fernando López

Abstract: Wake-up-word (WuW) detection is a fundamental component in voice assistants. Undesired activation of the device is often due to external noises such as background conversations, TV or music. In Telefónica we have been working on the use of a Speech Enhancement (SE) system in conjunction with the on-device “Ok Aura” WuW detector aiming to diminish the number of false positives. The work we present uses convolutional autoencoders at the waveform level, some fully convolutional and others with RNN. SE models are trained using a spectrum and waveform reconstruction losses along with the classifier loss (task-aware loss). We experiment with a wide variety of classifiers and show that concatenating the SE with a WuW detector does not harm the recognition rate in noise-absent environments, while the improvement is noticeable in very noisy conditions, even with very simple classifiers. Additionally, we tested that training SE and WuW detector models end-to-end provide the best results.