• Diarization Introduction & EEND Perceiver-based Diarization
    Speaker: Alicia Lozano Díez. Abstract: In this talk, I will present an introduction of the speaker diarization task as well as the latest approaches based on neural networks as self-attention end-to-end neural diarization (EEND) with encoder-decoder attractors (EDA) as opposed… Read More
  • Introduction to Reinforcement Learning.
    Speaker: Tamas Endrei. Abstract: Reinforcement learning (RL) has emerged as one of the most fascinating fields of machine learning, providing solutions to challenging problems ranging from complex robotics behaviors to optimizing neural network architectures. Despite its immense potential, RL’s complex… Read More
  • GPU Parallel Computing for Deep Learning
    Speaker: Beltrán Labrador Serrano. Abstract: Large Language Models (LLMs) is transforming natural language processing and are now impacting speech processing. This talk addresses the challenge of training these massive neural networks required to follow this trend. I will present GPU… Read More
  • Rotary Position Embeddings (RoPE) in Transformers.
    Speaker: Doroteo Torre Toledano. Abstract: Since Transformers were proposed in 2017, they have dominated the state-of-the-art in several domains including language modelling, speech processing, and even image processing. Although the main ideas of the original Transformers are essentially kept, there… Read More
  • Large Language Models in Protein Engineering
    Speaker: Natalia Pinto Estéban. Abstract: The intersection of artificial intelligence and protein engineering represents an innovative frontier in scientific exploration. In this presentation, titled ‘Large Language Models in Protein Engineering,’ we delve into the field of advanced language models, focusing… Read More
  • Lute and vihuela in the Renaissance period: instruments and music
    Speaker: Joaquín González Rodríguez. Abstract: In this talk we will present an overview of two extremely popular plucked musical instruments in XVI century in Europe, the Lute and its Spanish version the Vihuela. Sharing a common tuning and playing characteristics… Read More
  • DiarizationLM: speaker diarization post-processing with large language models
    Speaker: Laura Herrera Alarcón. Abstract: This paper presents a framework designed to post-process the outputs of speaker diarization systems using large language models (LLM). The framework aims to enhance the readability of the diarized transcripts and reduce the WDER. For… Read More
  • Fainess in Modern ASR Systems
    Speaker: Pilar Fernández Gallego. Abstract: Nowadays ASR (Automatic Speech Recognition) systems have dramatically improved, due both to advances in deep learning and to the collection of large datasets used to train the systems. However, it has been demonstrated in studies… Read More
  • Explainable Machine Learning
    Speaker: Sara Barahona Quirós. Abstract: Explainable Machine Learning (XAI) refers to the development of machine learning models and algorithms that not only make accurate decisions but also provide understandable and interpretable explanations for those predictions. In traditional machine learning, particularly… Read More
  • Generative Artificial Intelligence: A Global Overview
    Speaker: Diego de Benito Gorrón. Abstract: Generative Artificial Intelligence (GenAI) has made a strong impact on the technological landscape, redefining paradigms and possibilities. This talk offers a panoramic view of GenAI, with a specific focus on Large Language Models (LLMs)… Read More
  • Robust Wake-up Word by Two-stage Multi-resolution Ensembles
    Speaker: William Fernando López Gaviánez. Abstract: Voice-based interfaces rely on a wake-up word mechanism to initiate communication with devices. However, achieving a robust, energy-efficient, and fast detection remains a challenge. This paper addresses these real production needs by enhancing data… Read More
  • Towards automatic inspection of nuclear fuel elements in spent fuel storage with AI tools.
    Speaker: Sergio Segovia González. Abstract: New way to automatize the inspection of nuclear fuel elements in spent fuel storage processing video signal and audio signal. For video signal, it is developed a custom database including images from several nuclear facilities… Read More
  • FLIP (Fitness Landscape Inference for Proteins)
    Speaker: Natalia Pinto Estéban. Abstract: Machine learning is growing in significance across various research domains. One of these domains is biology, specifically focusing on protein engineering and directed evolution techniques. This presentation is grounded in the FLIP paper (Fitness Landscape… Read More
  • Knowledge Distillation to Compress and Accelerate Large Models
    Speaker: Laura Herrera Alarcón. Abstract: These papers present the idea of Knowledge Distillation, a method to compress and accelerate large models with high computational and storage cost. Thanks to this, these models can be used for real-time applications or in… Read More
  • An introduction to Spiking Neural Networks (SNNs) and neuromorphic computing
    Speaker: Doroteo Torre Toledano. Abstract: This talk is an overview of Spiking Neural Networks, a biologically inspired type of neural networks that outputs digital spikes over continuous time in an asynchronous way, instead of continuous values at frame-by-frame synchronous times… Read More
  • A Systematic Study on the Use of the Log-Likelihood Ratio Cost in Forensic Science
    Speaker: Daniel Ramos Castro. Abstract: It is increasingly common in forensic science to report evidential findings in terms of a likelihood ratio (LR). Such analyses are often supported by (semi-)automated LR systems based on statistical methods, which allows for validation… Read More
  • Language Models in Protein Engineering
    Speaker: Joaquín González Rodríguez. Abstract: The sequences of aminoacids describing a protein can be efficiently handled by language models. In this talk, present and future applications of Transformer-based protein Language Models are surveyed, focusing in databases, benchmarks and models already… Read More
  • Automatic Wheeze Segmentation Using Harmonic-Percussive Source Separation and Empirical Mode Decomposition
    Speaker: Miguel Ángel Martínez Pay. Abstract: Based on https://ieeexplore.ieee.org/document/10051156. Wheezes, a respiratory anomaly in patients with respiratory conditions, are significant for clinical assessment, particularly in gauging bronchial obstruction. While conventional auscultation is the norm for wheeze analysis, recent years emphasize… Read More
  • Personalized keyword spotting detection : Research internship @ Google
    Speaker: Beltrán Labrador Serrano. Abstract: Keyword spotting systems are used in a variety of applications, such as smart speakers and voice assistants. However, these systems can be challenged by diverse accents, age groups, and speaking conditions.In this talk, I will… Read More
  • Sound Event Detection with Conformer: the AUDIAS system for DCASE 2023
    Speaker: Sara Barahona Quirós. Abstract: The Conformer architecture has achieved state-of-the-art results in several tasks, including automatic speech recognition and automatic speaker verification. However, its utilization in sound event detection and in particular in the DCASE Challenge Task 4 has… Read More
  • Deployment of KWS models:​ audio features optimization and streaming mode​
    Speaker: William Fernando López Gavilánez. Abstract: The deployment process of Keyword Spotting (KWS) models depends on the target hardware, it normally includes merging components in a black box, binarization, quantization, and/or mobile optimization. In addition, while processing a continuous stream… Read More
  • Lines of research in the field of acoustic events detection
    Speaker: Sergio Segovia González. Abstract: Within the development of the doctoral thesis, whose objective is to work in the field of acoustic event detection, it has been carried out the implementation of several lines of research, such as using the… Read More
  • Fairness in the most popular ASR systems
    Speaker: Pilar Fernández Gallego Abstract: Nowadays ASR (Automatic Speech Recognition) systems have dramatically improved, due both to advances in deep learning and to the collection of large datasets used to train the systems. However, it has been demonstrated that some… Read More
  • VoxCeleb-Spain: design, acquisition and evaluation with deep neural networks of a database of Spanish celebrity voices
    Speaker: Manuel Otero González. Abstract: This work presents a new database, VozCeleb-Spain, captured following analogous protocols as the well-know VoxCeleb database, but using YouTube(TM) videos of celebrities of Spanish nationality. The evaluation of the database through various experiments is also… Read More
  • GuitarSet: A Dataset for Guitar Transcription
    Speaker: Diego de Benito Gorrón. Abstract: Based on https://guitarset.weebly.com/uploads/1/2/1/6/121620128/xi_ismir_2018.pdf. The guitar is a popular instrument for a variety of reasons, including its ability to produce polyphonic sound and its musical versatility. The resulting variability of sounds, however, poses significant challenges… Read More
  • Representing evidence for Bayesian updating: compositional evidence, privacy and calibration
    Speaker: Paul-Gauthier Noé. Abstract: Attribute privacy in multimedia technology aims to hide only one or a few personal characteristics, or attributes, of an individual rather than the full identity. To give a few examples, these attributes can be the sex,… Read More
  • Detection of abnormalities in electrocardiograms with 2 sensors using machine learning
    Speaker: Ana Molina Conesa. Abstract: This talk is based on the Physionet Challenge 2021, in which participants aim to design and implement an algorithm capable of automatically identifying any cardiac abnormalities present in electrocardiogram (ECG) recordings with 12, 6, 4,… Read More
  • Anomaly detection in 12-lead electrocardiograms using machine learning
    Speaker: Miguel González Rodríguez. Abstract: The Physionet Challenge 2021 is presented. The goal is to classify 27 types of cardiac anomalies from electrocardiograms using convolutional neural networks (CNN). The challenge database consists of over 30,000 patient records, making it one… Read More
  • Precomputed Sound Propagation for Virtual Reality & Gaming
    Speaker: Joaquín González Rodríguez. Abstract: This talk is based on: Parametric Wave Field Coding for Precomputed Sound Propagation (ACM Transactions on Graphics, Vol. 33, No. 4, Article 38, Publication Date: July 2014) Parametric Directional Coding for Precomputed Sound Propagation (ACM… Read More
  • Breath cycle detection in respiratory audios
    Speaker: Miguel Ángel Martínez Pay. Abstract: Neural networks applied to the detection of acoustic events in respiratory audios. Introduction to the ICBHI 2017 database dedicated to the classification of respiratory cycles into “normal”, “with crackles”, “with wheezes”, “with both”. Main… Read More
  • PhysioNet Challenge 2016: Classification of Heart Sound Recordings
    Speaker: Javier Galán Fernández. Abstract: Cardiovascular diseases are the leading cause of death in the world, accounting for 32% of all deaths recorded throughout the year. The 2016 PhysioNet challenge aimed to encourage the development of algorithms to classify heart… Read More
  • How speaker diarization evolved recently: from clustering to end-to-end approaches
    Speaker: Alicia Lozano Díez. Abstract: Speaker diarization systems aim to segment a multi-speaker audio recording according to speaker changes, providing the time stamps of the activity of each speaker, including segments where nobody speaks and those where more than one… Read More
  • VoxCeleb-Spain: Design, Acquisition and Preliminar Evaluation
    Speaker: Manuel Otero González. Abstract: Description of VoxCeleb and its latest Challenges (2019-2022), elaboration and capture of audio database of celebrities of Spanish nationality, and preliminary evaluation of a pre-trained system with the acquired data.
  • MusicLM: Generating music from text
    Speaker: Laura Herrera Alarcón Abstract: Based on https://arxiv.org/pdf/2301.11325.pdf. This paper presents a new model for generating high-fidelity music from text descriptions. It combines SoundStream, w2v-BERT and MuLan, 3 models that allow to obtain temporal coherence and high quality audios of… Read More
  • Iterative psuedo-forced alignment tool
    Speaker: W. Fernando López Gavilánez. Abstract: High-quality data labeling from specific domains is costly and human time-consuming. In this work, we propose an iterative pseudo-forced alignment algorithm for long audio files with low-quality transcriptions. The alignments are iteratively done by… Read More
  • Differentially Private Fine-Tuning for Language Models
    Speaker: Beltrán Labrador Serrano. Abstract: Based on https://arxiv.org/abs/2110.06500. In this talk we will comment the paper Differentially Private Fine-Tuning for Language Models, where the authors give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models,… Read More
  • Conformer Architecture for Sound Event Detection (DCASE) 
    Speaker: Sara Barahona Quirós. Abstract: Sound Event Detection is the task that is focused on automatizing the human’s ability of recognizing sound events in the environment. Over the last years, the creation of evaluations such as the Detection and Classification… Read More
  • MixMatch: A Holistic Approach to Semi-Supervised Learning
    Speaker: Diego de Benito Gorrón. Abstract: This talk is an overview of a NIPS 2019 paper by David Berthelot et al. (Google Research) that proposes a novel method for Semi-supervised learning: MixMatch. “Semi-supervised learning has proven to be a powerful… Read More
  • Highly accurate protein structure prediction with AlphaFold
    Speaker: Juan Ignacio Álvarez Trejos. Abstract: Based on https://www.nature.com/articles/s41586-021-03819-2. Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort, the structures of around 100,000 unique proteins have been… Read More
  • Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
    Speaker: Doroteo Torre Toledano. Abstract: Very recently (in Sept 2022) OpenAI has made freely available a speech recognition neural network called Whisper. One of the main differences with respect to the current state of the art is the use of… Read More
  • Dynamic Bayesian Networks for Temporal Prediction of Chemical Radioisotope Levels in Nuclear Power Plant Reactors
    Speaker: Daniel Ramos Castro. Abstract: Radiation dose in nuclear power plant reactors is known to be dominated by the presence of radioisotopes in the primary loop of the reactor. In order to strictly control it in normal operation (e.g., cleaning… Read More
  • Automatic adventitious respiratory sound analysis: A systematic review
    Speaker: Miguel Ángel Martínez Pay. Abstract: Based on https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0177926. Automatic detection or classification of adventitious sounds is useful to assist physicians in diagnosing or monitoring diseases such as asthma, Chronic Obstructive Pulmonary Disease, and pneumonia. This article contains a compilation… Read More
  • Training Speaker Recognition Systems with Limited Data
    Speaker: Guillermo Recio. Abstract: Based on paper https://www.isca-speech.org/archive/pdfs/interspeech_2022/vaessen22_interspeech.pdf. This work considers training neural networks for speaker recognition with smaller datasets compared to contemporary work. For this purpose, they propose three subsets of the VoxCeleb2 dataset. Each of these subsets contains… Read More
  • Exploring sequence-to-sequence transformer-transducer models for keyword spotting
    Speaker: Beltrán Labrador Serrano. Abstract: Beltrán’s final Google research internship presentation. This presentation introduces a transformer-transducer keyword spotting system that simultaneously optimizes ASR and keyword spotting losses using a sequence to sequence RNN-T loss. Each loss is further balanced using… Read More
  • Perceiver: General Perception with Iterative Attention
    Speaker: Juan Ignacio Álvarez Trejos. Abstract: Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for… Read More
  • Continual learning for recurrent neural networks
    Speaker: Doroteo Torre Toledano Abstract: The current trend in machine learning assumes that there is a fixed distribution of incoming data, so that a fixed model can be learned to map incoming data to output classes. However, real applications in… Read More
  • Source Separation for Sound Event Detection in Domestic Environments Using Jointly Trained Models
    Speaker: Diego de Benito Gorrón. Abstract: Sound Event Detection and Source Separation are closely related tasks: whereas the first aims to find the time boundaries of acoustic events inside a recording, the goal of the latter is to isolate each… Read More
  • Representaciones de audio self-supervised Wav2Vec2 para el reconocimiento de locutor
    Speaker: Laura Herrera. Abstract: In this Final Degree Project, different speech representations, extracted by unsupervised learning, have been used to train a speaker recognition system. In particular, Wav2Vec2.0 and WavLM features have been used as a novelty. The Wav2Vec2.0 features… Read More
  • End-to-end deep learning models for air traffic control speech recognition
    Speaker: Ana Belén Fernández Cordero. Abstract: For many years, Air Traffic Controllers have had to manually type the information they received and transmitted to pilots into the electronic flight strip systems. This time consuming activity contributed to a significant increase… Read More
  • Efficient Transformers for End-to-End Neural Speaker Diarization
    Speaker: Sergio Izquierdo. Abstract: The recently proposed End-to-End Neural speaker Diarization framework (EEND) handles speech overlap and speech activity detection natively. While extensions of this work have reported remarkable results in both two-speaker and multi-speaker diarization scenarios, these come at… Read More
  • Sound Event Detection in a large-scale audio dataset with multi-resolution neural networks
    Speaker: Sara Barahona Quirós. Abstract: Sound event detection is the task that aims to automatize the human’s ability of recognizing sound events in the environment by their particular acoustic information. For this purpose, deep learning techniques are employed to build… Read More
  • A Speaker Verification Backend with Robust Performance across Conditions
    Speaker: Joaquin Gonzalez-Rodriguez. Abstract: Presentation of the paper in https://arxiv.org/abs/2102.01760: L. Ferrer et al. “A Speaker Verification Backend with Robust Performance across Conditions”, 2021. Abstract of the paper (reproduced from the preprint): In this paper, we address the problem of… Read More
  • Linear-Gaussian Bayesian Network Applications to Forensic Chemistry
    Speaker: Elías Hernandis Prieto. Abstract: Forensic evidence evaluation using the likelihood ratio framework requires knowledge about the probability distribution of the data. For evaluating samples of glass remains, this translates to obtaining the joint probability distribution of the relative concentrations… Read More
  • Improvements in deep learning semi-supervised model selection for the optimization of different Sound Event Detection metrics
    Spaker: Cristina Moratilla. Abstract: Sound Event Detection is one of the most developed fields in the area of audio signal processing in the last decades. The objective of such detection is to locate the start and end instants of audio… Read More
  • Bias analysis in speaker recognition systems based in DNN-embeddings
    Speaker: Almudena Aguilera. Abstract: In this study we will evaluate the discriminatory behaviours that are generated in speaker recognition systems, specifically those that verify whether two audios belong to the same speaker or not. These systems work by extracting the… Read More
  • MetaAudio: A Few-Shot Audio Classification Benchmark
    Speaker: David Martín Gutiérrez. Abstract: Currently available benchmarks for few-shot learning (machine learning with few training examples) are limited in the domains they cover, primarily focusing on image classification. This work aims to alleviate this reliance on image-based benchmarks by… Read More
  • Speaker Diarization, X-vectors with Encoder-Decoder based attractors
    Speaker: Juan Ignacio Álvarez Trejos. Abstract: X-Vectors are speaker embeddings that emerge to address the speaker recognition task, surprisingly outperforming i-vectors in most speaker tasks. It is proposed to take advantage of the information contained in these embeddings by using… Read More
  • Gaussianization of LA-ICP-MS Features to Improve Calibration in Forensic Glass Comparison
    Speaker: Pablo Ramírez Hereza. Abstract: The forensic comparison of glass task aims to compare a glass sample of unknown source with a control glass sample of known source. In this work, we use multielemental features from laser ablation inductively coupled… Read More
  • Article review: “Objectifying evidence evaluation for gunshot residue comparisons using machine learning on criminal case data”
    Speaker: Daniel Ramos Castro Abstract: Basado en https://doi.org/10.1016/j.forsciint.2022.111293. “Comparative gunshot residue analysis addresses relevant forensic questions such as ‘did suspect X fire shot Y?’. More formally, it weighs the evidence for hypotheses of the form H1: gunshot residue particles found… Read More
  • Assessing Calibration in the regression setting
    Speaker: Sergio Álvarez Balanya. Abstract: Calibration is a desirable property of pattern recognition systems, especially when their predictions are going to be used to make decisions. In our group, we are used to dealing with calibration in classification tasks such… Read More
  • Call-sign recognition and understanding for noisy air-traffic transcripts using surveillance information
    Speaker: Ana Belén Fernández Cordero. Abstract: Air traffic control (ATC) relies on communication via speech between pilot and air-traffic controller (ATCO). The call-sign, as unique identifier for each flight, is used to address a specific pilot by the ATCO. Extracting… Read More
  • AVASpeech-SMAD: A speech and music activity detection database with label co-occurrence
    Speaker: Guillermo Recio Martín. Abstract: AVASpeech is a publicly available dataset created in 2018 to contribute to the task of speech activity detection (SAD) task. This dataset contains three different types of audio segments: clean speech, speech co-occuring with music… Read More
  • Conformer-based sound event detection with semi-supervised learning and data augmentation
    Speaker: Sara Barahona Quirós. Abstract: This paper presents a Conformer-based sound event detection (SED) method, which uses semi-supervised learning and data augmentation. The proposed method employs Conformer, a convolution-augmented Transformer that is able to exploit local features of audio data… Read More
  • Speaker Diarization with Region Proposal Network
    Speaker: Sergio Izquierdo del Álamo. Abstact: Speaker diarization is an important pre-processing step for many speech applications, and it aims to solve the “who spoke when” problem. Although the standard diarization systems can achieve satisfactory results in various scenarios, they… Read More
  • Conversational Agents for Health Care
    Speaker: Giuliano Lazzara. Abstract: Brief that focuses on people’s perception of Conversational Agents and proposes these technologies as a tool to deal with underestimated mental issues such as depression and anxiety. Referring to experiments done with “Woebot”, an automated conversational… Read More
  • data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
    Speaker: Sergio Segovia. Abstract: The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such… Read More
  • Data Augmentation for Decoupled Calibration of Deep Neural Network Classifiers
    Speaker: Sergio Márquez Carrero. Abstract: Modern Deep Neural Networks (DNN) have significantly outperformed those employed over a decade ago in terms of accuracy. Nonetheless, the outputs generated by these models are poorly calibrated, causing substantial issues in a variety of… Read More
  • Connectionist Temporal Classification (CTC) Speech Segmentation
    Speaker: W. Fernando López Gavilanez. Abstract: Motivated by the lack of high-quality labeled data for specific scenarios, such as emergencies in the home environment, we explored a CTC-segmentation method to generate a specific-purpose speech dataset. The project seeks the quality improvement of… Read More
  • BigSSL: Large-Scale Semi-Supervised Learning for ASR
    Speaker: Laura Herrera Abstract: This paper deals with results obtained on very large automatic speaker recognition models.A large amount of labelled data is not always available and sometimes they do not generalize enough. Consequently, the authors propose to use pre-trained… Read More
  • Efficient Neural Approaches for Automatic Speech Recognition
    Speaker: Doroteo Torre Toledano Abstract: Many different end-to-end neural approaches have been proposed in the last years in the field of automatic speech recognition (ASR). However, most of the research available compares systems only in terms of accuracy (word error… Read More
  • Structured Output Learning
    Speaker: María Pilar Fernández Rodríguez Abstract: Speech applications dealing with conversations require not only recognizing the spoken words, but also determining who spoke when, the language, punctuation, capitalization… To deal with it, it is typically addressed by merging the outputs… Read More
  • Voxceleb Experiment: fairness
    Speaker: Almudena Aguilera Abstract: The experiment is based on the dataset from Voxceleb [1], using the two pre-trained models. The main idea of these experiments was to study the fairness problems in different demographic groups present in the data base… Read More
  • Semi-Supervised Music Tagging Transformer
    Speaker: David Martín Abstract: Music Tagging Transformer (MTT) was recently released in the latest ISMIR 2021 Conference as one of the most erupting deep learning approaches for Music Information Retrieval. It consists of a semi-supervised approach where the model captures… Read More
  • Encoder-Decoder Based Attractor Calculation for End-to-End Neural Diarization
    Speaker: Alicia Lozano Díez Abstract: In this talk, we will deeply review the algorithms behind end-to-end systems for speaker diarization based on neural networks. In particular, we will describe how the encoder-decoder part of the model calculates “attractors” that capture… Read More
  • Unsupervised Sound Separation Using Mixture Invariant Training
    Speaker: Diego de Benito Gorrón Abstract: In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component… Read More
  • relMix: An open source software for DNA mixtures with related contributors
    Speaker: Elías Hernández Abstract: La prueba de ADN ha supuesto un gran avance en el contexto judicial y muchas veces es considerada como la prueba definitiva para condenar o absolver a un acusado. Los resultados de una prueba de ADN… Read More
  • Improving Fairness in Speaker Recognition
    Speaker: Almudena Aguilera Abstract: Speaker Recognition Systems aim to automatically recognize the identity of an individual from a recording of his/her speech or voice. Despite the progress of these systems in terms of accuracy, we must ask ourselves: “What happen… Read More
  • Speech Enhancement for Wake-up Word detection in Voice Assistants
    Speaker: William Fernando López Abstract: Wake-up-word (WuW) detection is a fundamental component in voice assistants. Undesired activation of the device is often due to external noises such as background conversations, TV or music. In Telefónica we have been working on… Read More
  • Unsupervised pre-training for learning speech representations: Wav2Vec and Wav2Vec2.0
    Speaker: Laura Herrera Abstract: These papers (https://arxiv.org/pdf/1904.05862.pdf and https://arxiv.org/pdf/2006.11477.pdf) explore unsupervised learning from raw audio for speech recognition.A large amount of labelled data is not always available, consequently wav2vec uses a causal convolutional network trained with large amounts of unlabelled… Read More
  • Large-scale pre-training of End-to-End Multi-Talker ASR for meeting Transcription with Single Distant Microphone
    Speaker: María Pilar Fernández Gallego Abstract: Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR). While various approaches have been proposed, all previous studies… Read More
  • Selective Kernel Networks
    Speaker: Sergio Segovia Abstract: It is well-known in the neuroscience community that the receptive field size of visual cortical neurons are modulated by the stimulus, which has been rarely considered in constructing CNNs. We propose a dynamic selection mechanism in… Read More
  • Calibration of Multiclass Probabilistic Classifiers
    Speaker: Sergio Márquez Abstract: Today’s Deep Neural Networks (DNNs) are used for numerous classification tasks, achieving high performance in terms of accuracy. In some cases, probabilistic classifiers, which assign a confidence value to each of the predictions made, are used.… Read More
  • Deep Learning Models with Self-Attention for the Detection of Audio Events
    Speaker: Julio González Abstract: This talk is a presentation of the BsC Thesis “Modelos de aprendizajeprofundo con auto-atención para detección de eventos de audio”. Itdescribes the implementation of the Transformer and Conformer neuralnetworks and presents the results of the test… Read More
  • End-to-end Speaker Diarization
    Speaker: Alicia Lozano Diez Abstract: In this talk, I will describe new approaches to the task of speaker diarization based on end-to-end neural networks, which present several advantages with respect to traditional systems based on clustering of speaker embeddings. I… Read More
  • Normalizing Flows for calibration of multiclass probabilistic classifiers
    Speaker: Sergio Márquez Abstract: Today’s Deep Neural Networks (DNNs) have achieved high performance in accuracy, far exceeding the ones used ten years ago. Nevertheless, the outputs provided by these modern networks are less well calibrated, becoming a major problem in… Read More
  • Transfer Learning from computer vision to audio event detection
    Speaker: Sergio Segovia Abstract: A brief summary about my lecture, in relation to my doctorate we are exploring the idea of applying the transfer learning technique between the domain of computer vision to the objective of detecting acoustic events. The… Read More
  • Modeling Uncertainty with Bayesian Neural Networks
    Speaker: Sergio Álvarez Abstract: Deep Neural Networks (DNNs) have revolutionized many fields in pattern recognition like speech recognition and object detection. There are, however, some applications in which Neural Networks struggle to offer competitive performance, mainly sensitive ones. These applications… Read More
  • New loss function to improve calibration with mixup
    Speaker: Juan Maroñas Molano Abstract: Deep Neural Networks (DNN) represent the state of the art in many tasks. However, due to their overparameterization, their generalization capabilities are in doubt and still a field under study. Consequently, DNN can overfit and… Read More
  • Self-supervised deep learning approaches for speaker recognition
    Speaker: Joaquín González Abstract: In this talk I will review the thesis “Self-supervised deep learning approaches for speaker recognition” presented by Umair Khan at the UPC (Universidad Politecnica de Cataluña) in January 2021, directed by Javier Hernando. In this thesis… Read More
  • Data augmentation for improved robustness against packet losses in ASR
    Speaker: María Pilar Fernández Gallego Abstract: Nowadays a large amount of companies record conversations, calls, sales or even meetings, in many cases to comply with the current legislation. Apart from the legal need, these recordings constitute an invaluable source of… Read More
  • End-to-end Query-by-example Spoken Term Detection
    Speaker: Juan Ignacio Álvarez Trejos Abstract: Query-by-example Spoken Term Detection (QbE-STD) is a keytechnology to harness the large amount of audiovisual content that is being stored and generated nowadays. Using audio example queries for STD has several advantages such as… Read More
  • AUDIAS-UAM System for the Albayzin 2020 Speech to Text Challenge
    Speaker: Beltrán Labrador Serrano Abstract: This presentation describes the system submitted by the AUDIAS-UAM team for the Albayzin 2020 Speech to Text Challenge. Our system is an end to end Transformer-based system built using ESPnet Toolkit. The acoustic model is… Read More
  • Multi-resolution Sound Event Detection
    Speaker: Diego de Benito Gorrón Abstract: The Sound Event Detection task aims to determine the temporal locations of acoustic events in audio clips. Over the recent years, this field is holding a rising relevance due to the introduction of datasets… Read More
  • BUT system for the Short-duration Speaker Verification challenge 2020
    Speaker: Alicia Lozano Díez Abstract: In this talk, I present the Brno University of Technology (BUT) system submitted for the text-dependent task of the Short-duration Speaker Verification challenge 2020, which was the best performing system for this task. We explored… Read More
  • Measuring Calibration in Deep Learning
    Speaker: Daniel Ramos Castro Abstract: In this talk, we will present the article Nixon et al. 2020, “Measuring Calibration in Deep Learning”, published in CVPR Workshops 2020. In this paper, the current most popular measure of calibration for deep learning,… Read More