- Automatic Speech Recognition in Dialectal Data (COSER)Speaker: Clara Adsuar Ávila. Abstract: In this project, we address the importance of enhancing the accessibility and usefulness of Deep Learning technologies for non-standard speakers. From a linguistic perspective, rural Spanish areas are rich in dialectal variety. However, most technology… Read More
- Emotion recognition in Spanish audioSpeaker: Manuel Otero González. Abstract: En esta charla se explicará la tarea de reconocimiento de emociones en audios en español, presentando los enfoques más avanzados del estado del arte, como Wav2Vec2 y W2V-Bert. Además, se introducirá el reto EmoSPeech, cuyo… Read More
- State of the Art in Sound Event Detection and DCASE EvaluationsSpeaker: Doroteo Torre Toledano. Abstract: In this talk I will review the most recent trajectory of the AUDIAS group in the field of Sound Event Detection (SED), highlighting our participations in DCASE evaluations (Task 4) from 2020 to 2023. Then,… Read More
- Large Language Models: From Theory to Practice in Text ClassificationSpeaker: Miguel Ángel Martínez Pay. Abstract: This work presents a comprehensive overview of Large Language Models (LLMs), from their theoretical framework to practical applications in text classification. It compares the effectiveness of two key approaches: fine-tuning embeddings of smaller models… Read More
- Integration of Emotional Information in Speaker Recognition SystemsSpeaker: Arturo Domínguez Santos. Abstract: This Master’s thesis addresses the challenge of investigating how emotions affect speakerverification and proposes a system that integrates this emotional variability to try toimprove accuracy. The focus is on the speaker’s emotions, which has traditionally… Read More
- Exploring Speech Foundation Models for End-to-End Speaker DiarizationSpeaker: Laura Herrera Alarcón. Abstract: In this Master’s Thesis the use of pre-trained models for the diarization task has beenstudied in order to exploit their ability to extract robust and discriminative features.In particular, the WavLM model has been combined with… Read More
- Interpretation of fingerprint evidence with likelihood ratios (LRs – Likelihood ratios)Speaker: Joaquín González Rodríguez. Abstract: The forensic fingerprint identification process based on the ACE-V method, widely implemented, makes absolute identification or exclusion decisions that depend on opinions that vary from expert to expert (for example, whether we consider an observed… Read More
- SGE & CCC Architecture – Introduction for BeginnersSpeaker: Adrián Aranda Marcos. Abstract: Simple technical introduction for using SGE with AUDIAS servers (Son of Grid Engine) and CCC (UAM’s Central Computing Center).
- Stabilising Reinforcement Learning with Past Action-State Representation LearningSpeaker: Tamas Endrei. Abstract: Although deep reinforcement learning (DRL) deals with sequential decision-making problems, temporal information representation is absent from state-of-the-art actor-critic algorithms. The reliance on only the current timestep information causes instability in concurrent actions. Furthermore, the over-reliance on… Read More
- Study of the predictive capacity of the efficacy of platelet-rich plasma (PRP) treatments in joint injuriesSpeaker: Berta Caunedo Castro. Abstract: This Final Degree Project evaluates Platelet-Rich Plasma (PRP) therapy as an alternative to traditional treatments for knee osteoarthritis, a prevalent joint condition. PRP uses regenerative growth factors from the patient’s blood, but its variability complicates… Read More
- Enhancing Sound Event Detection and Speaker Verification employing weak supervisionSpeaker: Sara Barahona Quirós. Abstract: In this seminar, we will explore approaches for training acoustic event detection and speaker verification systems employing limited labels. Specifically, for the first task, we will explain the optimization process of a system based on… Read More
- Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker ScenariosSpeaker: Juan Ignacio Álvarez Trejos. Abstract: This presentation covers the work presented at Odyssey 2024, focusing on speaker diarization in two-speaker scenarios. End-to-end neural speaker diarization systems are designed to handle overlapping speech while accurately distinguishing between speakers. In this… Read More
- Transformers for Binding Prediction of Hypoxia-Induced FactorsSpeaker: Manuel Fernando Mollón Laorca. Abstract: Hypoxia-inducible factors (HIFs) are proteins that play a crucial role in the cellular response to low oxygen levels. Accurate prediction of the binding of these factors to their target DNA is essential for understanding… Read More
- Whisper‑based spoken term detection systems for search on speech ALBAYZIN evaluation challengeSpeaker: Javier Tejedor Noguerales. Abstract: The vast amount of information stored in audio repositories makes necessary the development of efficient and automatic methods to search on audio content. In that direction, search on speech (SoS) has received much attention in… Read More
- Road map for Albayzin Diarization Challenge 2024Speaker: Jérémie Touati. Abstract: The diarization challenge of the 2024 Albayzin evaluation stands out by various difficulties. The recordings, which come from databases of Spanish radio and television programs, can last up to several hours, they contain an undetermined and… Read More
- Introduction to the Language-Based Audio Retrieval task.Speaker: Manuel Otero. Abstract: Language-Based Audio Retrieval is a task of the DCASE Challenge, which is based on the retrieval of audio information from natural language descriptions. Two of the best performing approaches in the state of the art will… Read More
- Data Augmentation for Respiratory Cycle ClassificationSpeaker: Miguel Ángel. Abstract: Analysing respiratory audios in order to detect and classify adventitious respiratory sounds is of vital importance for the development of continuous monitoring tools for patients with respiratory diseases. The ICBHI 2017 database is the most widely… Read More
- Diarization Introduction & EEND Perceiver-based DiarizationSpeaker: Alicia Lozano Díez. Abstract: In this talk, I will present an introduction of the speaker diarization task as well as the latest approaches based on neural networks as self-attention end-to-end neural diarization (EEND) with encoder-decoder attractors (EDA) as opposed… Read More
- Introduction to Reinforcement Learning.Speaker: Tamas Endrei. Abstract: Reinforcement learning (RL) has emerged as one of the most fascinating fields of machine learning, providing solutions to challenging problems ranging from complex robotics behaviors to optimizing neural network architectures. Despite its immense potential, RL’s complex… Read More
- GPU Parallel Computing for Deep LearningSpeaker: Beltrán Labrador Serrano. Abstract: Large Language Models (LLMs) is transforming natural language processing and are now impacting speech processing. This talk addresses the challenge of training these massive neural networks required to follow this trend. I will present GPU… Read More
- Rotary Position Embeddings (RoPE) in Transformers.Speaker: Doroteo Torre Toledano. Abstract: Since Transformers were proposed in 2017, they have dominated the state-of-the-art in several domains including language modelling, speech processing, and even image processing. Although the main ideas of the original Transformers are essentially kept, there… Read More
- Large Language Models in Protein EngineeringSpeaker: Natalia Pinto Estéban. Abstract: The intersection of artificial intelligence and protein engineering represents an innovative frontier in scientific exploration. In this presentation, titled ‘Large Language Models in Protein Engineering,’ we delve into the field of advanced language models, focusing… Read More
- Lute and vihuela in the Renaissance period: instruments and musicSpeaker: Joaquín González Rodríguez. Abstract: In this talk we will present an overview of two extremely popular plucked musical instruments in XVI century in Europe, the Lute and its Spanish version the Vihuela. Sharing a common tuning and playing characteristics… Read More
- DiarizationLM: speaker diarization post-processing with large language modelsSpeaker: Laura Herrera Alarcón. Abstract: This paper presents a framework designed to post-process the outputs of speaker diarization systems using large language models (LLM). The framework aims to enhance the readability of the diarized transcripts and reduce the WDER. For… Read More
- Fainess in Modern ASR SystemsSpeaker: Pilar Fernández Gallego. Abstract: Nowadays ASR (Automatic Speech Recognition) systems have dramatically improved, due both to advances in deep learning and to the collection of large datasets used to train the systems. However, it has been demonstrated in studies… Read More
- Explainable Machine LearningSpeaker: Sara Barahona Quirós. Abstract: Explainable Machine Learning (XAI) refers to the development of machine learning models and algorithms that not only make accurate decisions but also provide understandable and interpretable explanations for those predictions. In traditional machine learning, particularly… Read More
- Generative Artificial Intelligence: A Global OverviewSpeaker: Diego de Benito Gorrón. Abstract: Generative Artificial Intelligence (GenAI) has made a strong impact on the technological landscape, redefining paradigms and possibilities. This talk offers a panoramic view of GenAI, with a specific focus on Large Language Models (LLMs)… Read More
- Robust Wake-up Word by Two-stage Multi-resolution EnsemblesSpeaker: William Fernando López Gaviánez. Abstract: Voice-based interfaces rely on a wake-up word mechanism to initiate communication with devices. However, achieving a robust, energy-efficient, and fast detection remains a challenge. This paper addresses these real production needs by enhancing data… Read More
- Towards automatic inspection of nuclear fuel elements in spent fuel storage with AI tools.Speaker: Sergio Segovia González. Abstract: New way to automatize the inspection of nuclear fuel elements in spent fuel storage processing video signal and audio signal. For video signal, it is developed a custom database including images from several nuclear facilities… Read More
- FLIP (Fitness Landscape Inference for Proteins)Speaker: Natalia Pinto Estéban. Abstract: Machine learning is growing in significance across various research domains. One of these domains is biology, specifically focusing on protein engineering and directed evolution techniques. This presentation is grounded in the FLIP paper (Fitness Landscape… Read More
- Knowledge Distillation to Compress and Accelerate Large ModelsSpeaker: Laura Herrera Alarcón. Abstract: These papers present the idea of Knowledge Distillation, a method to compress and accelerate large models with high computational and storage cost. Thanks to this, these models can be used for real-time applications or in… Read More
- An introduction to Spiking Neural Networks (SNNs) and neuromorphic computingSpeaker: Doroteo Torre Toledano. Abstract: This talk is an overview of Spiking Neural Networks, a biologically inspired type of neural networks that outputs digital spikes over continuous time in an asynchronous way, instead of continuous values at frame-by-frame synchronous times… Read More
- A Systematic Study on the Use of the Log-Likelihood Ratio Cost in Forensic ScienceSpeaker: Daniel Ramos Castro. Abstract: It is increasingly common in forensic science to report evidential findings in terms of a likelihood ratio (LR). Such analyses are often supported by (semi-)automated LR systems based on statistical methods, which allows for validation… Read More
- Language Models in Protein EngineeringSpeaker: Joaquín González Rodríguez. Abstract: The sequences of aminoacids describing a protein can be efficiently handled by language models. In this talk, present and future applications of Transformer-based protein Language Models are surveyed, focusing in databases, benchmarks and models already… Read More
- Automatic Wheeze Segmentation Using Harmonic-Percussive Source Separation and Empirical Mode DecompositionSpeaker: Miguel Ángel Martínez Pay. Abstract: Based on https://ieeexplore.ieee.org/document/10051156. Wheezes, a respiratory anomaly in patients with respiratory conditions, are significant for clinical assessment, particularly in gauging bronchial obstruction. While conventional auscultation is the norm for wheeze analysis, recent years emphasize… Read More
- Personalized keyword spotting detection : Research internship @ GoogleSpeaker: Beltrán Labrador Serrano. Abstract: Keyword spotting systems are used in a variety of applications, such as smart speakers and voice assistants. However, these systems can be challenged by diverse accents, age groups, and speaking conditions.In this talk, I will… Read More
- Sound Event Detection with Conformer: the AUDIAS system for DCASE 2023Speaker: Sara Barahona Quirós. Abstract: The Conformer architecture has achieved state-of-the-art results in several tasks, including automatic speech recognition and automatic speaker verification. However, its utilization in sound event detection and in particular in the DCASE Challenge Task 4 has… Read More
- Deployment of KWS models: audio features optimization and streaming modeSpeaker: William Fernando López Gavilánez. Abstract: The deployment process of Keyword Spotting (KWS) models depends on the target hardware, it normally includes merging components in a black box, binarization, quantization, and/or mobile optimization. In addition, while processing a continuous stream… Read More
- Lines of research in the field of acoustic events detectionSpeaker: Sergio Segovia González. Abstract: Within the development of the doctoral thesis, whose objective is to work in the field of acoustic event detection, it has been carried out the implementation of several lines of research, such as using the… Read More
- Fairness in the most popular ASR systemsSpeaker: Pilar Fernández Gallego Abstract: Nowadays ASR (Automatic Speech Recognition) systems have dramatically improved, due both to advances in deep learning and to the collection of large datasets used to train the systems. However, it has been demonstrated that some… Read More
- VoxCeleb-Spain: design, acquisition and evaluation with deep neural networks of a database of Spanish celebrity voicesSpeaker: Manuel Otero González. Abstract: This work presents a new database, VozCeleb-Spain, captured following analogous protocols as the well-know VoxCeleb database, but using YouTube(TM) videos of celebrities of Spanish nationality. The evaluation of the database through various experiments is also… Read More
- GuitarSet: A Dataset for Guitar TranscriptionSpeaker: Diego de Benito Gorrón. Abstract: Based on https://guitarset.weebly.com/uploads/1/2/1/6/121620128/xi_ismir_2018.pdf. The guitar is a popular instrument for a variety of reasons, including its ability to produce polyphonic sound and its musical versatility. The resulting variability of sounds, however, poses significant challenges… Read More
- Representing evidence for Bayesian updating: compositional evidence, privacy and calibrationSpeaker: Paul-Gauthier Noé. Abstract: Attribute privacy in multimedia technology aims to hide only one or a few personal characteristics, or attributes, of an individual rather than the full identity. To give a few examples, these attributes can be the sex,… Read More
- Detection of abnormalities in electrocardiograms with 2 sensors using machine learningSpeaker: Ana Molina Conesa. Abstract: This talk is based on the Physionet Challenge 2021, in which participants aim to design and implement an algorithm capable of automatically identifying any cardiac abnormalities present in electrocardiogram (ECG) recordings with 12, 6, 4,… Read More
- Anomaly detection in 12-lead electrocardiograms using machine learningSpeaker: Miguel González Rodríguez. Abstract: The Physionet Challenge 2021 is presented. The goal is to classify 27 types of cardiac anomalies from electrocardiograms using convolutional neural networks (CNN). The challenge database consists of over 30,000 patient records, making it one… Read More
- Precomputed Sound Propagation for Virtual Reality & GamingSpeaker: Joaquín González Rodríguez. Abstract: This talk is based on: Parametric Wave Field Coding for Precomputed Sound Propagation (ACM Transactions on Graphics, Vol. 33, No. 4, Article 38, Publication Date: July 2014) Parametric Directional Coding for Precomputed Sound Propagation (ACM… Read More
- Breath cycle detection in respiratory audiosSpeaker: Miguel Ángel Martínez Pay. Abstract: Neural networks applied to the detection of acoustic events in respiratory audios. Introduction to the ICBHI 2017 database dedicated to the classification of respiratory cycles into “normal”, “with crackles”, “with wheezes”, “with both”. Main… Read More
- PhysioNet Challenge 2016: Classification of Heart Sound RecordingsSpeaker: Javier Galán Fernández. Abstract: Cardiovascular diseases are the leading cause of death in the world, accounting for 32% of all deaths recorded throughout the year. The 2016 PhysioNet challenge aimed to encourage the development of algorithms to classify heart… Read More
- How speaker diarization evolved recently: from clustering to end-to-end approachesSpeaker: Alicia Lozano Díez. Abstract: Speaker diarization systems aim to segment a multi-speaker audio recording according to speaker changes, providing the time stamps of the activity of each speaker, including segments where nobody speaks and those where more than one… Read More
- VoxCeleb-Spain: Design, Acquisition and Preliminar EvaluationSpeaker: Manuel Otero González. Abstract: Description of VoxCeleb and its latest Challenges (2019-2022), elaboration and capture of audio database of celebrities of Spanish nationality, and preliminary evaluation of a pre-trained system with the acquired data.
- MusicLM: Generating music from textSpeaker: Laura Herrera Alarcón Abstract: Based on https://arxiv.org/pdf/2301.11325.pdf. This paper presents a new model for generating high-fidelity music from text descriptions. It combines SoundStream, w2v-BERT and MuLan, 3 models that allow to obtain temporal coherence and high quality audios of… Read More
- Iterative psuedo-forced alignment toolSpeaker: W. Fernando López Gavilánez. Abstract: High-quality data labeling from specific domains is costly and human time-consuming. In this work, we propose an iterative pseudo-forced alignment algorithm for long audio files with low-quality transcriptions. The alignments are iteratively done by… Read More
- Differentially Private Fine-Tuning for Language ModelsSpeaker: Beltrán Labrador Serrano. Abstract: Based on https://arxiv.org/abs/2110.06500. In this talk we will comment the paper Differentially Private Fine-Tuning for Language Models, where the authors give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models,… Read More
- Conformer Architecture for Sound Event Detection (DCASE) Speaker: Sara Barahona Quirós. Abstract: Sound Event Detection is the task that is focused on automatizing the human’s ability of recognizing sound events in the environment. Over the last years, the creation of evaluations such as the Detection and Classification… Read More
- MixMatch: A Holistic Approach to Semi-Supervised LearningSpeaker: Diego de Benito Gorrón. Abstract: This talk is an overview of a NIPS 2019 paper by David Berthelot et al. (Google Research) that proposes a novel method for Semi-supervised learning: MixMatch. “Semi-supervised learning has proven to be a powerful… Read More
- Highly accurate protein structure prediction with AlphaFoldSpeaker: Juan Ignacio Álvarez Trejos. Abstract: Based on https://www.nature.com/articles/s41586-021-03819-2. Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort, the structures of around 100,000 unique proteins have been… Read More
- Whisper: Robust Speech Recognition via Large-Scale Weak SupervisionSpeaker: Doroteo Torre Toledano. Abstract: Very recently (in Sept 2022) OpenAI has made freely available a speech recognition neural network called Whisper. One of the main differences with respect to the current state of the art is the use of… Read More
- Dynamic Bayesian Networks for Temporal Prediction of Chemical Radioisotope Levels in Nuclear Power Plant ReactorsSpeaker: Daniel Ramos Castro. Abstract: Radiation dose in nuclear power plant reactors is known to be dominated by the presence of radioisotopes in the primary loop of the reactor. In order to strictly control it in normal operation (e.g., cleaning… Read More
- Automatic adventitious respiratory sound analysis: A systematic reviewSpeaker: Miguel Ángel Martínez Pay. Abstract: Based on https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0177926. Automatic detection or classification of adventitious sounds is useful to assist physicians in diagnosing or monitoring diseases such as asthma, Chronic Obstructive Pulmonary Disease, and pneumonia. This article contains a compilation… Read More
- Training Speaker Recognition Systems with Limited DataSpeaker: Guillermo Recio. Abstract: Based on paper https://www.isca-speech.org/archive/pdfs/interspeech_2022/vaessen22_interspeech.pdf. This work considers training neural networks for speaker recognition with smaller datasets compared to contemporary work. For this purpose, they propose three subsets of the VoxCeleb2 dataset. Each of these subsets contains… Read More
- Exploring sequence-to-sequence transformer-transducer models for keyword spottingSpeaker: Beltrán Labrador Serrano. Abstract: Beltrán’s final Google research internship presentation. This presentation introduces a transformer-transducer keyword spotting system that simultaneously optimizes ASR and keyword spotting losses using a sequence to sequence RNN-T loss. Each loss is further balanced using… Read More
- Perceiver: General Perception with Iterative AttentionSpeaker: Juan Ignacio Álvarez Trejos. Abstract: Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for… Read More
- Continual learning for recurrent neural networksSpeaker: Doroteo Torre Toledano Abstract: The current trend in machine learning assumes that there is a fixed distribution of incoming data, so that a fixed model can be learned to map incoming data to output classes. However, real applications in… Read More
- Source Separation for Sound Event Detection in Domestic Environments Using Jointly Trained ModelsSpeaker: Diego de Benito Gorrón. Abstract: Sound Event Detection and Source Separation are closely related tasks: whereas the first aims to find the time boundaries of acoustic events inside a recording, the goal of the latter is to isolate each… Read More
- Representaciones de audio self-supervised Wav2Vec2 para el reconocimiento de locutorSpeaker: Laura Herrera. Abstract: In this Final Degree Project, different speech representations, extracted by unsupervised learning, have been used to train a speaker recognition system. In particular, Wav2Vec2.0 and WavLM features have been used as a novelty. The Wav2Vec2.0 features… Read More
- End-to-end deep learning models for air traffic control speech recognitionSpeaker: Ana Belén Fernández Cordero. Abstract: For many years, Air Traffic Controllers have had to manually type the information they received and transmitted to pilots into the electronic flight strip systems. This time consuming activity contributed to a significant increase… Read More
- Efficient Transformers for End-to-End Neural Speaker DiarizationSpeaker: Sergio Izquierdo. Abstract: The recently proposed End-to-End Neural speaker Diarization framework (EEND) handles speech overlap and speech activity detection natively. While extensions of this work have reported remarkable results in both two-speaker and multi-speaker diarization scenarios, these come at… Read More
- Sound Event Detection in a large-scale audio dataset with multi-resolution neural networksSpeaker: Sara Barahona Quirós. Abstract: Sound event detection is the task that aims to automatize the human’s ability of recognizing sound events in the environment by their particular acoustic information. For this purpose, deep learning techniques are employed to build… Read More
- A Speaker Verification Backend with Robust Performance across ConditionsSpeaker: Joaquin Gonzalez-Rodriguez. Abstract: Presentation of the paper in https://arxiv.org/abs/2102.01760: L. Ferrer et al. “A Speaker Verification Backend with Robust Performance across Conditions”, 2021. Abstract of the paper (reproduced from the preprint): In this paper, we address the problem of… Read More
- Linear-Gaussian Bayesian Network Applications to Forensic ChemistrySpeaker: Elías Hernandis Prieto. Abstract: Forensic evidence evaluation using the likelihood ratio framework requires knowledge about the probability distribution of the data. For evaluating samples of glass remains, this translates to obtaining the joint probability distribution of the relative concentrations… Read More
- Improvements in deep learning semi-supervised model selection for the optimization of different Sound Event Detection metricsSpaker: Cristina Moratilla. Abstract: Sound Event Detection is one of the most developed fields in the area of audio signal processing in the last decades. The objective of such detection is to locate the start and end instants of audio… Read More
- Bias analysis in speaker recognition systems based in DNN-embeddingsSpeaker: Almudena Aguilera. Abstract: In this study we will evaluate the discriminatory behaviours that are generated in speaker recognition systems, specifically those that verify whether two audios belong to the same speaker or not. These systems work by extracting the… Read More
- MetaAudio: A Few-Shot Audio Classification BenchmarkSpeaker: David Martín Gutiérrez. Abstract: Currently available benchmarks for few-shot learning (machine learning with few training examples) are limited in the domains they cover, primarily focusing on image classification. This work aims to alleviate this reliance on image-based benchmarks by… Read More
- Speaker Diarization, X-vectors with Encoder-Decoder based attractorsSpeaker: Juan Ignacio Álvarez Trejos. Abstract: X-Vectors are speaker embeddings that emerge to address the speaker recognition task, surprisingly outperforming i-vectors in most speaker tasks. It is proposed to take advantage of the information contained in these embeddings by using… Read More
- Gaussianization of LA-ICP-MS Features to Improve Calibration in Forensic Glass ComparisonSpeaker: Pablo Ramírez Hereza. Abstract: The forensic comparison of glass task aims to compare a glass sample of unknown source with a control glass sample of known source. In this work, we use multielemental features from laser ablation inductively coupled… Read More
- Article review: “Objectifying evidence evaluation for gunshot residue comparisons using machine learning on criminal case data”Speaker: Daniel Ramos Castro Abstract: Basado en https://doi.org/10.1016/j.forsciint.2022.111293. “Comparative gunshot residue analysis addresses relevant forensic questions such as ‘did suspect X fire shot Y?’. More formally, it weighs the evidence for hypotheses of the form H1: gunshot residue particles found… Read More
- Assessing Calibration in the regression settingSpeaker: Sergio Álvarez Balanya. Abstract: Calibration is a desirable property of pattern recognition systems, especially when their predictions are going to be used to make decisions. In our group, we are used to dealing with calibration in classification tasks such… Read More
- Call-sign recognition and understanding for noisy air-traffic transcripts using surveillance informationSpeaker: Ana Belén Fernández Cordero. Abstract: Air traffic control (ATC) relies on communication via speech between pilot and air-traffic controller (ATCO). The call-sign, as unique identifier for each flight, is used to address a specific pilot by the ATCO. Extracting… Read More
- AVASpeech-SMAD: A speech and music activity detection database with label co-occurrenceSpeaker: Guillermo Recio Martín. Abstract: AVASpeech is a publicly available dataset created in 2018 to contribute to the task of speech activity detection (SAD) task. This dataset contains three different types of audio segments: clean speech, speech co-occuring with music… Read More
- Conformer-based sound event detection with semi-supervised learning and data augmentationSpeaker: Sara Barahona Quirós. Abstract: This paper presents a Conformer-based sound event detection (SED) method, which uses semi-supervised learning and data augmentation. The proposed method employs Conformer, a convolution-augmented Transformer that is able to exploit local features of audio data… Read More
- Speaker Diarization with Region Proposal NetworkSpeaker: Sergio Izquierdo del Álamo. Abstact: Speaker diarization is an important pre-processing step for many speech applications, and it aims to solve the “who spoke when” problem. Although the standard diarization systems can achieve satisfactory results in various scenarios, they… Read More
- Conversational Agents for Health CareSpeaker: Giuliano Lazzara. Abstract: Brief that focuses on people’s perception of Conversational Agents and proposes these technologies as a tool to deal with underestimated mental issues such as depression and anxiety. Referring to experiments done with “Woebot”, an automated conversational… Read More
- data2vec: A General Framework for Self-supervised Learning in Speech, Vision and LanguageSpeaker: Sergio Segovia. Abstract: The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such… Read More
- Data Augmentation for Decoupled Calibration of Deep Neural Network ClassifiersSpeaker: Sergio Márquez Carrero. Abstract: Modern Deep Neural Networks (DNN) have significantly outperformed those employed over a decade ago in terms of accuracy. Nonetheless, the outputs generated by these models are poorly calibrated, causing substantial issues in a variety of… Read More
- Connectionist Temporal Classification (CTC) Speech SegmentationSpeaker: W. Fernando López Gavilanez. Abstract: Motivated by the lack of high-quality labeled data for specific scenarios, such as emergencies in the home environment, we explored a CTC-segmentation method to generate a specific-purpose speech dataset. The project seeks the quality improvement of… Read More
- BigSSL: Large-Scale Semi-Supervised Learning for ASRSpeaker: Laura Herrera Abstract: This paper deals with results obtained on very large automatic speaker recognition models.A large amount of labelled data is not always available and sometimes they do not generalize enough. Consequently, the authors propose to use pre-trained… Read More
- Efficient Neural Approaches for Automatic Speech RecognitionSpeaker: Doroteo Torre Toledano Abstract: Many different end-to-end neural approaches have been proposed in the last years in the field of automatic speech recognition (ASR). However, most of the research available compares systems only in terms of accuracy (word error… Read More
- Structured Output LearningSpeaker: María Pilar Fernández Rodríguez Abstract: Speech applications dealing with conversations require not only recognizing the spoken words, but also determining who spoke when, the language, punctuation, capitalization… To deal with it, it is typically addressed by merging the outputs… Read More
- Voxceleb Experiment: fairnessSpeaker: Almudena Aguilera Abstract: The experiment is based on the dataset from Voxceleb [1], using the two pre-trained models. The main idea of these experiments was to study the fairness problems in different demographic groups present in the data base… Read More
- Semi-Supervised Music Tagging TransformerSpeaker: David Martín Abstract: Music Tagging Transformer (MTT) was recently released in the latest ISMIR 2021 Conference as one of the most erupting deep learning approaches for Music Information Retrieval. It consists of a semi-supervised approach where the model captures… Read More
- Encoder-Decoder Based Attractor Calculation for End-to-End Neural DiarizationSpeaker: Alicia Lozano Díez Abstract: In this talk, we will deeply review the algorithms behind end-to-end systems for speaker diarization based on neural networks. In particular, we will describe how the encoder-decoder part of the model calculates “attractors” that capture… Read More
- Unsupervised Sound Separation Using Mixture Invariant TrainingSpeaker: Diego de Benito Gorrón Abstract: In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component… Read More
- relMix: An open source software for DNA mixtures with related contributorsSpeaker: Elías Hernández Abstract: La prueba de ADN ha supuesto un gran avance en el contexto judicial y muchas veces es considerada como la prueba definitiva para condenar o absolver a un acusado. Los resultados de una prueba de ADN… Read More
- Improving Fairness in Speaker RecognitionSpeaker: Almudena Aguilera Abstract: Speaker Recognition Systems aim to automatically recognize the identity of an individual from a recording of his/her speech or voice. Despite the progress of these systems in terms of accuracy, we must ask ourselves: “What happen… Read More
- Speech Enhancement for Wake-up Word detection in Voice AssistantsSpeaker: William Fernando López Abstract: Wake-up-word (WuW) detection is a fundamental component in voice assistants. Undesired activation of the device is often due to external noises such as background conversations, TV or music. In Telefónica we have been working on… Read More
- Unsupervised pre-training for learning speech representations: Wav2Vec and Wav2Vec2.0Speaker: Laura Herrera Abstract: These papers (https://arxiv.org/pdf/1904.05862.pdf and https://arxiv.org/pdf/2006.11477.pdf) explore unsupervised learning from raw audio for speech recognition.A large amount of labelled data is not always available, consequently wav2vec uses a causal convolutional network trained with large amounts of unlabelled… Read More
- Large-scale pre-training of End-to-End Multi-Talker ASR for meeting Transcription with Single Distant MicrophoneSpeaker: María Pilar Fernández Gallego Abstract: Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR). While various approaches have been proposed, all previous studies… Read More
- Selective Kernel NetworksSpeaker: Sergio Segovia Abstract: It is well-known in the neuroscience community that the receptive field size of visual cortical neurons are modulated by the stimulus, which has been rarely considered in constructing CNNs. We propose a dynamic selection mechanism in… Read More
- Calibration of Multiclass Probabilistic ClassifiersSpeaker: Sergio Márquez Abstract: Today’s Deep Neural Networks (DNNs) are used for numerous classification tasks, achieving high performance in terms of accuracy. In some cases, probabilistic classifiers, which assign a confidence value to each of the predictions made, are used.… Read More
- Deep Learning Models with Self-Attention for the Detection of Audio EventsSpeaker: Julio González Abstract: This talk is a presentation of the BsC Thesis “Modelos de aprendizajeprofundo con auto-atención para detección de eventos de audio”. Itdescribes the implementation of the Transformer and Conformer neuralnetworks and presents the results of the test… Read More