Speaker: Sergio Segovia González.

Abstract:

Traditional SED approaches are based on either specialized models or on these models in combination with general audio embedding extractors. In this article we propose to reframe SED as an object detection task in the time–frequency plane and introducing a direct adaptation of modern YOLO detectors to audio. To our knowledge, this is among the first works to employ YOLOv8 and YOLOv11 not merely as feature extractors but as end-to-end models that localize and classify sound events on mel-spectrograms. Methodologically, our approach (i) generates mel-spectrograms on the fly from raw audio to streamline the pipeline and enable transfer learning from vision models; (ii) applies curriculum learning that exposes the detector to progressively more complex mixtures, improving robustness to overlaps; and (iii) augments training with synthetic audio constructed under DCASE 2023 guidelines to enrich rare classes and challenging scenarios.Comprehensive experiments compare our YOLO-based framework against strong CRNN and Conformer baselines. In our experiments on the DCASE-style setting, the method achieves competitive detection accuracy relative to CRNN and Conformer baselines, with gains in some overlapping/noisy conditions and shortcomings for several short-duration classes. These results suggest that adapting modern object detectors to audio can be effective in this setting, while broader generalization and encoder-augmented comparisons remain open.