Speaker: Sara Barahona Quirós.

Abstract: This paper presents a Conformer-based sound event detection (SED) method, which uses semi-supervised learning and data augmentation. The proposed method employs Conformer, a convolution-augmented Transformer that is able to exploit local features of audio data more effectively using CNNs, while global features are captured with Transformer. For SED, both global information on background sound and local information on foreground sound events are essential for modeling and identifying various types of sounds. Since Conformer can capture both global and local features using a single architecture, our proposed method is able to model various characteristics of sound events effectively. In addition to this novel architecture, we further improve performance by utilizing a semi-supervised learning technique, data augmentation, and post-processing optimized for each sound event class. We demonstrate the performance of our proposed method through experimental evaluation using the DCASE2020 Task4 dataset. Our experimental results show that the proposed method can achieve an event-based macro F1 score of 50.6% when using the validation set, significantly outperforming the baseline method score (34.8%). Our system achieved a score of 51.1% when using the DCASE2020 challenge’s evaluation set, the best results among the 72 submissions.