Speaker: María Pilar Fernández Rodríguez

Abstract: Speech applications dealing with conversations require not only recognizing the spoken words, but also determining who spoke when, the language, punctuation, capitalization… To deal with it, it is typically addressed by merging the outputs of several separate systems. These systems are trained independently with different objective functions. Thank to recent advances in sequence to sequence learning each time is more common to find works that join several tasks together. This presentation shows several works which join Diarization and ASR or join ASR and language identification, opening a lot of possibilities to address in the speech recognition field.