This is a list of open Master’s Thesis Proposals in our group.

If you’re interested in a Master’s Thesis in our group different from those listed here, send an e-mail to audias@uam.es.

If you’re looking for a Bachelor’s Thesis (TFG), send an e-mail to audias@uam.es including a record of your grades.

List of Open Master’s Thesis

Automatic Language Recognition based on Embeddings

CONTACT: Alicia Lozano Díez (alicia.lozano@uam.es)

The task of automatic language recognition (or language identification, LID) aims to determine which language is spoken in a given audio recording. The technology developed to deal with this task has evolved in the recent years from the classical modeling based on factor analysis and i-vectors from acoustic features, to the use of bottleneck (BN) features, deep neural network embeddings (DNN embeddings or x-vectors) or end-to-end DNN classifiers. Along with the task of speaker recognition, research on LID has been driven by the series of NIST technology evaluations, which provide a common framework to test different approaches to challenging tasks (similar languages, short utterances, etc.).

The aim of this Master’s Thesis is to build different LID systems, mainly based on DNNs, analyzing and comparing the strengths and weaknesses of each method. The systems will be evaluated in different benchmarks provided for the well-known NIST Language Recognition Evaluations (LREs).


Sound Event Detection in a large-scale audio dataset with multi-resolution neural networks.

CONTACT: Diego de Benito Gorrón (diego.benito@uam.es)

Sound Event Detection (SED) is the task that aims to find the temporal boundaries of certain acoustic events (e.g. Speech, Dog) in audio recordings. SED is holding a rising relevance over the recent years, due to the creation of specific datasets such as Google AudioSet or DESED (Domestic Environment Sound Event Detection) and the introduction of competitive evaluations like the DCASE Challenge (Detection and Classification of Acoustic Scenes and Events). The objective of this Master’s Thesis is to improve the performance of Sound Event Detection systems, which are based in convolutional neural networks, through the use of acoustic features extracted using several time-frequency resolutions, thus obtaining multi-resolution systems. This approach helps to exploit the different temporal and spectral characteristics presented by each sound event category.


Calibration of Language Recognition Systems

CONTACT: Daniel Ramos (daniel.ramos@uam.es)

Calibration is paramount for critical decision-making and reliable probabilistic classifiers. Language recognition has pioneered calibration of multiclass classifiers by the use of Gaussian back-ends, multiclass logistic regression or linear transformations. Recently, several techniques have been proposed to calibrate multiclass probabilistic classifiers based on Deep Neural Networks, such as Temperature Scaling, Gaussian processes or Bayesian neural networks; with potential improvements over classical techniques in any probabilistic classifier such as a language recognizer. However, the experimental set-ups for these proposals are typically based on image classification tasks where the classes are equally represented and there are plenty of training data for all of them. On the other hand, the language recognition task is challenging due to poorly represented languages, because it is fairly typical that there is data scarcity in some of the classes; and to the extreme variability of the speech signal. These difficulties are a serious burden when trying to adapt the aforementioned recent proposals to the language recognition task, an issue that remains unexplored.

In this Master’s Thesis, the Student will explore different calibration algorithms in language recognition, from classical techniques to more recent proposals. If any of these algorithms were not used before for this task (for instance, temp scaling), they will be adapted in order to cope with the typically unbalanced data for training each language. We will analyze the performance of different calibration algorithms and their behavior, strenghts and limitations. Finally, we will base our experimental set-ups in widely known benchmarks such as Language Recognition Evaluations.

This work is expected to lead to a scientific publication.


Hierarchical Bayesian Models for Forensic Comparison

CONTACT: Daniel Ramos (daniel.ramos@uam.es)

The goal of forensic science is using scientific procedures to aid in decision making in judicial trials. The general approach is that a decision maker (judge or jury, typically) needs to obtain the information of a certain evidence, in order to make their decision in the best possible way. For example: in a hit-and-run case with a car in which the driver fled, there may be traces of glass on the victim’s clothing. If, later, police forces arrest a suspect who owns a car with a broken windshield, a glass sample from the broken windshield can be compared with another sample of the victim’s glass in their clothing, in order to relate the two. Thus, the decision maker can obtain information on whether the victim’s glass comes from the suspect’s vehicle or not.


In this Master’s Thesis, the problem of forensic glass comparison is explored using a Bayesian decision framework, according to current recommendations from forensic institutions in Europe. The comparison of the glasses will be carried out using hierarchical Bayesian models, in which it is intended to model both samples probabilistically, taking into account all the uncertainty of the problem. A priori, it is intended to face the problem by normalizing and transforming the data from the glass features, extracted using the analytical chemistry technique known as Laser Ablation with Inducted Coupled Plasma and Mass Spectrometry (LA-ICP-MS). The objective is that the incorporation of uncertainty through the use of fully-Bayesian models will allow the results of the comparison to be expressed as well-calibrated bayes factors, leading to adequate decisions by the judge or jury. Although it is intended to start the Thesis with a comparison of normalization methods of the glass features and with the use of Gaussian models, we do not rule out the use of more complex models that required sampling approximations (Markov-Chain Monte-Carlo, Hamiltonian Monte-Carlo) or even deep probabilistic models such as variational autoencoders.


This work is expected to lead to a scientific publication.


Linguistically-constrained formant-based embeddings for automatic speaker recognition from observable meaningful features

CONTACT: Joaquín González Rodríguez (joaquin.gonzalez@uam.es)

Formant frequencies and bandwidths are highly distinguishing speaker features, and phoneticians strongly rely on their analysis for speaker discrimination, especially in forensic speaker recognition. Based on the supervisor previous work (“Linguistically-constrained formant-based I-vectors for automatic speaker recognition”, available: http://dx.doi.org/10.1016/j.specom.2015.11.002) , the objective of this project is to develop compact and robust DNN-based utterance embeddings of speaker frame-based formants information from massive amounts of audio data from thousands of speakers in challenging state-of-the-art evaluation frameworks as VoxCeleb.


Accurate impulse response estimation of actual acoustic devices and rooms for DNN-based data augmentation

CONTACT: Joaquín González Rodríguez (joaquin.gonzalez@uam.es)

A major challenge of speech, music and audio applications is how to deal with the huge variability and diversity of acoustic conditions present in application-domain audio recordings. Even though DNNs are capable of learning from massive amounts of data, it is impossible to have enough audio examples in every condition. Domain adaptation techniques can help in the DNN learning process to adapt to new conditions, but the larger benefit has been obtained from data augmentation techniques, where synthetic new data is obtained from clean audio data through acoustic simulation, feeding the DNN training with both actual and synthetic data. In audio applications, data augmentation is performed simulating the acoustic propagation through discrete-time convolution with point-to-point room impulse responses estimated from real rooms, adding different amounts and types of background noise. However, the available public collection of actual RIRs is very limited, and new in-domain applications require accurate estimation of impulse responses of both actual devices and rooms. In this project, a whole set-up will be developed for que accurate acquisition of actual impulse responses in new environments which will be later tested as channel information for data augmentation purposes in DNN-based learning.


Modelling phone boundaries in neural end-to-end speech recognition

CONTACT: Doroteo Torre Toledano (doroteo.torre@uam.es)

Current state of the art in speech recognition is dominated by neural end-to-end systems that are trained to predict phones, character or word fragments directly from the acoustic features. Models such as CTC, Attention-based enconder-decoders, Transformers or Conformers are capable to transform sequences of acoustic feature vectors into sequences of output symbols with high accuracy. All these models produce an approximate alignment between the output symbols and the input features, however those alignments are not very accurate. This Master’s Thesis proposes to explore the possibility of explicitly modeling phone boundaries in neural end-to-end speech recognition with two goals: to increase the precision of the alignments (which are useful for several applications beyond speech-to-text) and to try to obtain better precision in speech-to-text by taking advantage of this more detailed modeling. 


An Evaluation of Bayesian Methods to standarize datasets metrics

CONTACT: Juan Maroñas Molano ( jmaronas@prhlt.upv.es )

Note: This MsC will be co-advised with Daniel Ramos.

The UCI dataset is a common way to evaluate many Bayesian Models. However, in contrast to other datasets (MNIST, CIFAR10, Imagenet, CIFAR100) the performance evaluation is not so standardized. In this MsC the purpose is to code up from scratch several models and standardize a procedure to evaluate the UCI datasets. The main motivation is to provide baseline results in many standard methods and to fix a training-test split procedure, so that future works can be easily compared without the need of recoding up many baselines, which is always a hard task to perform when doing research. These standard methods include linear/logistic regression, deep neural networks, (Possibly hierarchical) Bayesian Neural Networks with different inference methods, Gaussian Processes, and Deep Gaussian Processes (with possibly different methods). The work can be done either in Pytorch, Tensorflow or JAX and it is free for the candidate to choose (the recommendation is using Pytorch, although JAX is also a great option). The candidate is expected to be highly motivated and willing to learn about different aspects of (probabilistic) machine learning.  The results from this work will be published in a Journal/conference and a Github repository will be set up to provide access to the machine learning community. Basic ML pre-requisites such as: stochastic gradient descent, dropout, batch normalization, data augmentation, stochastic optimization, standard Gaussian process regression, etc.; are expected (although this will not be a limitation for being selected). Interested candidates should send an email with a (no more than half) A4 document explaining life motivations and why are they doing a Master in computer science.