Speaker: Sara Barahona Quirós.

Abstract:

The Conformer architecture has shown excellent performance in accurately classifying sound events but lacks temporal precision when predicting time boundaries. While increasing the length of the input sequences can mitigate this issue, it also increases model complexity and the risk of overfitting. To address this challenge, we propose leveraging the progressive downsampling and grouped attention mechanisms of the Efficient Conformer, allowing the input of longer sequences while maintaining efficiency. Additionally, we incorporate Squeeze-and-Excitation modules to enhance the CNN-based feature extraction by focusing on frequency and channel attention. This strategy improves input quality for our Efficient Conformer system without significantly increasing the number of parameters. We evaluated our method using the setup proposed for the DCASE Challenge 2023 Task 4A, which faces the lack of strong annotations in audio recordings by exploiting both real and synthetic data, as well as weak labels and unlabeled data. Our proposed models outperform previous Conformer-based systems for sound event detection while achieving promising results in terms of efficiency.