Speaker: Sara Barahona Quirós.

Abstract: Sound event detection is the task that aims to automatize the human’s ability of recognizing sound events in the environment by their particular acoustic information. For this purpose, deep learning techniques are employed to build systems that are capable of localizing relevant sound events in an audio clip.

This Master’s Thesis focuses on building a system and evaluating it in the Task 4 of the DCASE (Detection and Classification of Acoustic Scenes and Events) challenge, in which the DESED (Domestic Environment Sound Event Detection) dataset is employed to detect domestic sound events. For the task in question, is crucial to exploit both local and global context for identifying all types of sound events. Therefore, the proposed approach is a Conformer-based system, which is able to extract global and local features in a single architecture by employing attention mechanisms and convolutional networks.

Besides, different time-frequency resolution models are defined by modifying their feature extraction stage, with the aim of studying their effects on the detection of diverse sound event categories. Additionally, the single-resolution models are combined to build a multi-resolution system. By doing this, the results are enhanced as certain classes benefit from different resolution points. We further improve the performance by defining a class-wise post-processing, based on thresholding and median filtering, in which the final scores are transform into activation sequences. Moreover, for this class-wise post-processing different objective metrics are studied.

The performance of the different models trained for this task is shown by evaluating them with the two PSDS (Polyphonic Sound Detection Score) scenarios proposed for the DCASE challenge. The results show the effectiveness of employing a Conformer based model over other architectures as well as the benefits of the multi-resolution approach, which significantly outperforms the models trained with a single resolution.