End-to-end deep learning models for air traffic control speech recognition

Speaker: Ana Belén Fernández Cordero.

Abstract: For many years, Air Traffic Controllers have had to manually type the information they received and transmitted to pilots into the electronic flight strip systems. This time consuming activity contributed to a significant increase in the controllers’ workload, who looked overwhelmed in the times of great air transport demands as the one expected in the forthcoming years. With the aim of facilitating and speeding up this task, they began to be implemented different Automatic Speech Recognition techniques that completely automated the transcription of conversations between controllers and pilots. This Master Thesis investigates the outcomes of the application of the most current Deep Learning systems in this context as one of these techniques. Thus, this project focuses on the study and application of neural networks based on the most advances architectures of the state of the art, such as the Transformers and Conformers, which enable to convert spoken words into digitally readable formats with a great accuracy. To this effect, they are used a set of models of this type pre-trained over the Librispeech corpus, obtained from read audiobooks, and they are evaluated the results over the HIWIRE database, specific for aeronautical cockpit communication. Besides, with the aim of providing a fair evaluation, they are done a series of modifications in the HIWIRE sentences, which are also classified according to the appearance or absence of their words in the Librispeech sentences. In this way, they are obtained results with levels of error of around 25%, that is, approximately 6 times higher than the ones obtained with classical models for clean audios, and of around 80%, that is, equal or even lower in the case of the high noise level audios. With this, it can be demonstrated, on the one hand, the inefficiency of the use of models trained in generic databases applied, without any type of adaptation, for the evaluation in really specific contexts as the aeronautical one and, on the other, the efficiency of the current architectures in this task, which has led to the obtenction of relatively small errors taking into account the experimental conditions commented. In conclusion, the outcomes of this project provide support to the arguments that promote the importance of the use of similar databases for the training and evaluation when working with really concrete data, as well as of the development of Automatic Speech Recognition models based on modern Deep Learning techniques to ensure the achievement of highly accurate transcriptions.