Tutorial 1: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Speech Processing

Presenters: Yu Zhang, Bo Li, Daniel Park, Google

Semi-supervised learning (SSL), which uses unlabeled data to enhance the performance of labeled tasks, has recently played a crucial part in improving public automatic speech recognition (ASR) benchmarks. A combination of pre-training [1]–[13] and self-training [14]–[25] methods have been utilized to enable deep networks to push the state-of-the-art (SoTA) performance on public ASR datasets [12], [23], [26]. Despite the success and exciting developments in this domain, this setting for semi supervised learning is limited in a few aspects. First, the unsupervised data is tailored to the supervised task and pretrained models on Libri-Light has shown limited generalization capacity to different domains in some instances. Second, the Libri-Light dataset is not much bigger than industrial scaled labeled datasets. Third, the supervised tasks considered are much smaller compared to practical tasks on which the performance of the network needs to be improved.

In this tutorial, we will explore how to build a universal speech understanding model that is capable of transcribing speech from many languages and many domains, as well as obtaining superior performance on many non-ASR speech understanding tasks such as speech translation and non-semantic speech classification tasks (such language-id detection, speaker-id detection, etc). More precisely, we present:

  • Exploring SSL on an industrial model scale: from 600M to 8B.
  • Exploring SSL on an industrial data scale: millions hours of massive multilingual data.
  • A new framework for both speech and text pretraining that is suitable for industrial scale.

Yu Zhang is currently a research scientist at Google Brain. He received his Ph.D degree in computer science from Massachusetts Institute of Technology in 2017. During his Ph.D, he worked on improving speech recognition performance. He is a fan of open source projects and contributed or involved to develop CNTK, MXNet and ESPNet to facilitate ASR research. Currently, his research interests are improving ML model performance for various speech processing applications, with a focus on sequence to sequence modeling. Yu is a main contributor to Google’s next generation RNNT ASR model and  Tacotron based text-to-speech system.

Bo Li received the Ph.D degree in computer science from the School of Computing, National University of Singapore in 2014 and the B.E. degree in computer engineering from the School of Computer, Northwestern Polytechnical University, China, in 2008. He is currently a staff research scientist at Google. His research interests are mainly in massively multilingual end-to-end automatic speech recognition using semi-supervised learning, lifelong learning and transfer learning.

Daniel Park is a research scientist at Google Brain, where he started working as an AI resident in 2018. His research interests include automatic speech recognition, semi-supervised learning, neural architecture search, multimodal learning and audio generation. Daniel received his PhD in Physics from MIT in 2012 and has held postdoctoral positions for high energy theoretical physics research at Stony Brook University and Rutgers University.

Tutorial 2: TorchAudio Tutorial

Presenters: Xiaohui Zhang, Zhaoheng Ni, Jeff Hwang, Caroline Chen, Meta

This session will give an overview of the TorchAudio library and three tutorials covering advanced usages of its components. In tutorial Source Separation and Speech Enhancement, we will demonstrate how to 1) perform speech separation by using a ConvTasNet model trained on the Libri2Mix dataset, 2) perform music source separation by using a Hybrid Demucs model trained on the MUSDB18-HQ dataset, and 3) build and run an end-to-end multi-channel speech enhancement model training pipeline. In Streaming Automatic Speech Recognition (ASR), we will walk participants through loading speech streams, applying transforms, extracting features, and passing features to a pre-trained streaming ASR model to generate real-time transcriptions. Finally, in Self supervised learning (SSL) pipeline, we will demonstrate TorchAudio’s SSL support by showcasing pre-trained SSL models (wav2vec2.0, HuBERT, Voxpopuli), end-to-end training recipes (HuBERT pre-training and fine-tuning), downstream datasets used in the SUPERB benchmark, and a highly efficient CTC decoder for evaluating fine-tuned SSL models.

Xiaohui Zhang is currently a research scientist in the PyTorch Audio team of Meta. He obtained his PhD in Electrical Engineering and Master in Applied Math and Stats from Center for Language and Speech Processing (CLSP) at the Johns Hopkins University (JHU), supervised by Dan Povey and Sanjeev Khudanpur, and then joined Meta as a research scientist in 2018. His contributions to the ASR community spanned over acoustic modeling, discriminative training, optimization, pronunciation learning, OOV recovery, etc. He was one of the main contributors of Kaldi.

Zhaoheng Ni is currently a research scientist in the PyTorch Audio team of Meta. He graduated from City University of New York supervised by Michael Mandel then joined Meta AI as a research scientist in 2021. His research interests are single-channel and multi-channel speech enhancement, speech separation, and robust ASR.

Caroline Chen is currently a software engineer in the PyTorch Audio team of Meta. She graduated from MIT with a Bachelor’s degree in Computer Science and Engineering, and joined Meta AI in 2021.

Jeff Hwang is an engineer on Meta’s PyTorch Audio team.

Tutorial 3: Towards Solving Cocktail Party Problem with Artificial Intelligence

Presenter: Dr. Chenglin Xu, Kuaishou Technology

Human has the remarkable ability to focus on the attended voice in a cocktail party. How to make the machine have such ability has been studied for decades. With the recent artificial intelligence revolution, speech separation and extraction techniques have achieved breakthroughs towards solving the cocktail party problem. By solving this, many speech tasks in human communication and human machine interaction could be made possible in a cocktail party environment.

This tutorial will cover the basic concepts of speech separation and extraction up to the recent developments. The blind source separation methods by mimicking human’s bottom-up process will be summarized first. The speaker extraction techniques are then introduced to relax some of the limitations of the blind source separation by mimicking human’s top-down process. Audio and visual clues as the references will be studied to assist the extraction process. The down-stream applications in a cocktail party environment will be further reviewed. Finally, challenges and opportunities will be discussed.

Chenglin Xu is currently with Audio and Video Technology Group in Kuaishou Technology, China. Dr. Xu received his B.Sc. and M.Sc. from Northwestern Polytechnical University, China in 2012 and 2015, and PhD degree from Nanyang Technological University, Singapore in 2020. After that, he worked at National University of Singapore (2020-2021) as a research fellow. His research interests include source separation, speaker extraction, speech enhancement, speaker verification, speech recognition and deep learning. He has published over 40 prestigious journal and conference papers, including IEEE TPAMI, IEEE/ACM TASLP, Neural Networks, ICASSP and INTERSPEECH. He served as an Area Chair and Technical Program Chair in O-COCOSDA 2020 and 2021.

Tutorial 4: Quantum Machine Learning for Speech Processing: from Theoretical Foundations to Practices

Presenters: Prof. Jun Qi, Fudan Unversity, Shanghai, China; Huck Yang, Ph.D. candidate, Georgia Insitute of Technology, Atlanta, GA, USA

State-of-the-art machine learning (ML), particularly based on deep neural networks (DNN),has enabled a wide spectrum of successful applications ranging from the everyday deployment of speech recognition and computer vision to the frontier of scientific research in synthetic biology. Despite rapid theoretical and empirical progress in DNN-based regression and classification, DNN training algorithms are computationally expensive, even beyond the physical limits of classical hardware. The imminent advent of quantum computing devices opens up new possibilities for exploiting quantum machine learning (QML) to improve the computational efficiency of ML algorithms in new domains. In particular, the advance in quantum hardware enables the QML algorithms to run in noisy intermediate-scale quantum (NISQ) devices. Furthermore, we could employ hybrid quantum-classical models that rely on optimizing parametric quantum circuits, which are resilient to quantum noise errors, and admit many practical QML implementations on NISQ devices. In this tutorial, we discuss how to set up quantum neural networks and put forth the related applications in speech and acoustics processing. The tutorial includes sections of an introduction to quantum machine learning, optimizing quantum neural networks, and the use of variational quantum circuits for speech and acoustic processing.

Dr. Jun Qi is now an Assistant Professor in Electronic Engineering at Fudan University, Shanghai, China. He received his Ph.D. in the School of Electrical and Computer Engineering at Georgia Institute of Technology, Atlanta, GA, in 2022, advised by Prof. Chin-Hui Lee and Prof. Xiaoli Ma. Previously, he obtained two Masters in Electrical Engineering from the University of Washington, Seattle, and Tsinghua University, Beijing, in 2013 and 2017, respectively. Besides, he was a research intern in the Deep Learning Technology Center at Microsoft Research, Redmond, WA, Tencent AI Lab, WA, and MERL, MA, USA. Dr. Qi was the recipient of 1st prize in Xanadu AI Quantum Machine Learning Competition 2019, and his ICASSP paper on quantum speech recognition was nominated as the best paper candidate in 2022. Besides, he gave two Tutorials on Quantum Neural Networks for Speech and Language Processing at the venues of IJCAI’21 and ICASSP’22.

Huck Yang is a final-year Ph.D. candidate working on robust and privacy-preserving speech recognition and sequence modeling, advised by Prof. Chin-Hui Lee at Georgia Tech, GA, USA. He received his B.Sc. from National Taiwan University in 2016. He has worked on large-scale ASR-LM with Dr. Ivan Bulyko and Dr. Andreas Stolcke at Amazon Alexa AI, WA, USA, in 2020 and 2021; multilingual speech recognition at Google Research, CA, USA, with Dr. Bo Li and Dr. Yu Zhang in 2022. He received the Judges’ award at DCASE 2021, the best reproducible system award at DCASE 2020, the Xanadu AI Quantum ML Research award 1st Place in 2019, the EPFL summer research fellowship in 2018, and Wallace H. Coulter Fellowship in 2017.

Tutorial 5: Recent Advances on Automatic Dialogue Evaluation

Presenters: Luis Fernando D'Haro, Universidad Polit_ecnica de Madrid; Chen Zhang, National University of Singapore

In recent years, dialogue systems have attracted significant interests from both the academia and the industry. Especially the discipline of open-domain dialogue systems, a.k.a chatbots, which has gained great momentum. Since 2016, data-driven generative models become popular in open-domain dialogue systems research. Assessing the performance of such models involves extensive human evaluation, which is both time- and cost- intensive. Hence, during the model development phase, researchers and practitioners must rely on automatic evaluation metrics. Yet, a long-standing challenge is the lack of meaningful metrics, that correlate well with human evaluation. Over the past three years, there has been considerable progress towards meaningful automatic evaluation metrics for dialogue. Taxonomy of dialogue evaluation are defined. More and more standard benchmark for meta-evaluation of the metrics are created. Various meaningful, reference-free, and model-based metrics are proposed.

Our tutorial covers the recent advancement of automatic dialogue evaluation research, mainly about the development in the field from 2016 to 2022. We will discuss (1) common NLG metrics that are used in dialogue evaluation and the problems associated with them; (2) taxonomy of dialogue evaluation; (3) the newly established dialogue evaluation benchmarks and metrics; (4) future research directions.

Luis Fernando D’Haro is Electronic Engineer (2000) from Universidad Autónoma de Occidente (Colombia) and PhD by Universidad Politécnica de Madrid (2009). During his PhD, he was a Visiting Researcher at the I6 HLT-PR Group in Aachen, Germany (2005) and AT&T Research labs in NJ, USA (2006). Later he made a postdoctoral research stay at the Speech Processing Group in Brno, Czech Republic (2011). From 2014-2018 he worked at I2R, A*STAR in Singapore. Since 2018 he is an Associate Professor at Universidad Politécnica de Madrid (Spain) and he is currently a member of the Speech Technology and Machine Learning group.

He has participated in +40 research projects (2 European, 18 National, 9 private, 12 institutional), authors +23 top journal, +140 international conference papers, and he is editor for 2 books with Springer and invited Editor for 2 special issues at Computer Speech and Language and IEEE-ACM TASLP. He is usual reviewer in +5 top journals, 10 top conferences (including area chair at ACL2020), and for National research programs such as PRELUDIUM (Poland) and RGC (Hong Kong). He co-organized DSTC challenges in 2015, 2016, 2021 and 2022, JSALT2020 (organized by Johns Hopkins-University), WoChat2016-2018 and DBDC4-5, which have the common goal of advancing dialogue systems and their automatic evaluation. He also helped to organize Interspeech2014, HAI2016 (where he participated as presenter for the tutorial: “Natural Language in Human-Robot Interaction”), IWSDS2018, and general chair for IWSDS2020, which was held in Madrid with more than 180 registered participants. In 2021, he was the faculty advisor for the Spanish team Genuine2 at the Alexa Socialbot Grant Challenge (SGC4).

His current research focuses on spoken dialogue and NLP systems. This includes automatic evaluation and controlled multimodal and multilingual generation for open-domain dialogue systems, language and speaker recognition, as well as automatic evaluation for machine translation.

Chen Zhang is a final year PhD candidate with Electrical & Computer Engineering (ECE) department of National University of Singapore (NUS). He is also associated with Robert Bosch (SEA) under the NUS-Bosch Industrial PhD Programme. His main research interests include dialogue systems, especially automatic dialogue evaluation and open-domain dialogue generation. His work on “Investigating the Impact of Pre-trained Language Models on Dialog Evaluation” receives the best paper award in IWSDS-2021. He was one of the main organizers of DSTC10 track 5 challenge on “Automatic Evaluation and Moderation of Open-domain Dialogue Systems”. Currently, he is part of the organizing committee of DSTC11.