Seeing Sound: From Computer Vision to Sound Event Detection

Speaker: Sergio Segovia González.

Abstract: This talk presents the trajectory of my PhD from image- and video-based AI to its later transfer into audio and Sound Event Detection. The central idea is how visual perception methods can inspire audio event localization by treating spectrograms as structured detection spaces. I will discuss the YOLO-based SED framework, the role of curriculum learning and event morphology, and the limits of vision-to-audio transfer when compared with audio-native models.