Language-Based Audio Retrieval (DCASE Evaluations)

Speaker: Doroteo Torre Toledanos.

Abstract:

Language-based audio retrieval is the task of retrieving audio segments containing sound described in a natural language text. This task was first proposed in a DCASE Challenge in 2022 as a subtask of the audio captioning task. Since then, it has been proposed as a standalone task. Typical systems make use of pretrained text and audio embedding extractors, which are re-trained to align audio and text embeddings in a shared embedding space via contrastive learning. In this talk we will analyze the state-of-the-art in this task and the details of the current challenge.