Speaker: Sergio Izquierdo.

Abstract: The recently proposed End-to-End Neural speaker Diarization framework (EEND) handles speech overlap and speech activity detection natively. While extensions of this work have reported remarkable results in both two-speaker and multi-speaker diarization scenarios, these come at the cost of a long training process that requires considerable memory and computational power. In this work, we explore the integration of efficient transformer variants into the Self-Attentive EEND with Encoder-Decoder based Attractors (SA-EEND EDA) architecture. Since it is based on Transformers, the cost of training SA-EEND EDA is driven by the quadratic time and memory complexity of their self-attention mechanism. We verify that the use of a linear attention mechanism in SA-EEND EDA decreases GPU memory usage by 22%. We conduct experiments to measure how the increased efficiency of the training process translates into the two-speaker diarization error rate on CALLHOME, quantifying the impact of increasing the size of the batch, the model or the sequence length on training time and diarization performance. In addition, we propose an architecture combining linear and softmax attention that achieves an acceleration of 12% with a small relative DER degradation of 2%, while using the same GPU memory as the softmax attention baseline.