You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection

Speaker: Rosa María Hornero Romera.

Abstract:

Presentation of paper https://arxiv.org/abs/2109.00962 Audio segmentation and sound event detection are essential aspects of machine listening, focusing on identifying acoustic classes and their boundaries. These tasks play a key role in applications such as audio-content analysis, speech recognition, audio indexing, and music information retrieval.

In recent years, most research has relied on segmentation-by-classification, a technique that divides audio into small frames and classifies each frame independently. In this paper, we introduce a novel approach called You Only Hear Once (YOHO), inspired by the YOLO algorithm widely used in computer vision. Unlike frame-based classification, YOHO formulates acoustic boundary detection as a regression problem. It achieves this by using separate output neurons to both detect the presence of an audio class and predict its start and end points.

YOHO demonstrates a relative F-measure improvement of 1% to 6% over the state-of-the-art Convolutional Recurrent Neural Network across multiple datasets for audio segmentation and sound event detection. Additionally, as YOHO operates in a more end-to-end manner with fewer neurons, inference speed is at least six times faster than segmentation-by-classification. Furthermore, since it directly predicts acoustic boundaries, post-processing and smoothing are approximately seven times faster.