Exploring sequence-to-sequence transformer-transducer models for keyword spotting

Speaker: Beltrán Labrador Serrano.

Abstract: Beltrán’s final Google research internship presentation. This presentation introduces a transformer-transducer keyword spotting system that simultaneously optimizes ASR and keyword spotting losses using a sequence to sequence RNN-T loss. Each loss is further balanced using sequence-discriminative MBR training. At inference, we take inspiration on the efficiency of end-to-end keyword spotting system and we use a simple 1-best decoding and a decision function that depends exclusively on the instantaneous softmax output at a given frame input. Overall, our novel TT-KWS system outperforms the classic ASR system and has similar performance to the classic keyword spotting system while bringing the advantages of sequence-to-sequence training. Furthermore, when combined with a end-to-end keyword spotting system, the proposed TT-KWS system can reduce FNR by 15% at a low 0.5 FP rate. Paper: https://arxiv.org/abs/2211.06478.