Speaker: Laura Herrera Alarcón
Abstract: Based on https://arxiv.org/pdf/2301.11325.pdf. This paper presents a new model for generating high-fidelity music from text descriptions. It combines SoundStream, w2v-BERT and MuLan, 3 models that allow to obtain temporal coherence and high quality audios of several minutes of duration. The results obtained exceed the state of the art of previous systems. In addition, they propose an extension that allows using both descriptive text and whistled melody as input to the model. On the other hand, the authors have published the dataset created for the evaluation which contains 5500 music pairs and their corresponding text description.