Plenary Speaker 1
Speaker: Dr Jinyu Li, Partner Applied Science Manager, Microsoft
Title: Advancing end-to-end automatic speech recognition and beyond
Jinyu Li received the Ph.D. degree from Georgia Institute of Technology and joined Microsoft in 2008. He has led the development of deep-learning based automatic speech recognition technologies including both hybrid models and the most recent end-to-end models for Microsoft products since 2012, which enables Microsoft’s success in industry with state-of-the-art speech recognition products in Cortana, Teams, Xbox, Skype, etc. Currently, he is a Partner Applied Science Manager in Microsoft, leading a team to design and improve advanced speech modeling algorithms and technologies. His major research interests cover several topics in speech processing, including end-to-end modeling, deep learning, acoustic modeling, speech separation, and noise robustness, etc. He is the leading author of the book “Robust Automatic Speech Recognition — A Bridge to Practical Applications”, Academic Press, 2015. He has been a member of IEEE Speech and Language Processing Technical Committee and has served as the area chair of ICASSP since 2017. He also served as the associate editor of IEEE/ACM Transactions on Audio, Speech and Language Processing from 2015 to 2020. He was elected as the Industrial Distinguished Leader at Asia-Pacific Signal and Information Processing Association in 2021.
The speech community is transitioning from hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models achieved state-of-the-art results in most benchmarks in terms of ASR accuracy, there are lots of practical factors that affect the production model deployment decision, including low-latency streaming, leveraging text-only data, and handling overlapped speech etc. Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized.
In this talk, I will overview the recent advances in E2E models with the focus on technologies addressing those challenges from the perspective of industry. To design a high-accuracy low-latency E2E model, a masking strategy was introduced into Transformer Transducer. I will discuss technologies which can leverage text-only data for general model training via pretraining and adaptation to a new domain via augmentation and factorization. Then, I will extend E2E modeling for streaming multi-talker ASR. I will also show how we go beyond ASR by extending the learning in E2E ASR into a new area like speech translation and build high-quality E2E speech translation models even without any human labeled speech translation data. Finally, I will conclude the talk with some new research opportunities we may work on.
Plenary Speaker 2
Speaker: Prof Eng Siong Chng, Associate Professor, Nanyang Technological University
Title: Recent progress in code-switch Singapore English+Mandarin large vocabulary continuous speech recognition
Dr Chng Eng Siong is currently an Associate Professor in the School of Computer Science and Engineering (SCSE), Nanyang Technological University (NTU), Singapore. Prior to joining NTU in 2003, he worked in several research centers/companies, namely: Knowles Electronics (USA), Lernout and Hauspie (Belgium), Institute of Infocomm Research (I2R, Singapore), and RIKEN (Japan) with a focus in signal processing and speech research. He received both PhD and BEng (Hons) from Edinburgh University, U.K. in 1996 and 1991 respectively.
His research currently focuses on speech recognition using DNN frameworks, low resources, noisy conditions, and adaptation to target domains (accent, use-cases). Additionally, he explores multilingual code-switch speech recognition such as English/Mandarin and English/Malay.
To date, he has been a Principal Investigator of research grants awarded by Alibaba, NTU-Rolls Royce, Mindef, MOE, and AStar with a total funding amount of over S$10 million under the “Speech and Language Technology Program (SLTP)” at SCSE. He has graduated 17 PhD students and 10 Master’s Engineering students. His publications include 2 edited books and over 100 journal/conference papers. He has served as the publication chair for 5 international conferences (Human-Agent Interaction 2016, INTERSPEECH 2014, APSIPA-2010, APSIPA-2011, ISCSLP-2006) and the local organizing committee in ASRU 2019.
Modern Speech recognition has a long history, stretching from the 70s. It received renewed interest and significant improvement in recognition performance due to the injection of DNN approaches into its acoustic modeling abilities in 2013, and lately transformed itself from the traditional Acoustic+Language+Decoder approach to an end-to-end system that has almost reached state-of-the-art performance.
In this talk, we will share our experience in developing codes-switching English/Mandarin speech recognition as well as recent advances in this field.
Plenary Speaker 3
Speaker: Kate Knill, Principal Research Associate, University of Cambridge
Title: Automated Assessment and Feedback: the Role of Spoken Grammatical Error Correction
Dr. Kate Knill is a Principal Research Associate at the Department of Engineering and the Automatic Language Teaching and Assessment Institute (ALTA), Cambridge University. She is the Principal Investigator for the ALTA Spoken Language Processing (SLP) Technology Project. Kate was sponsored by Marconi Underwater Systems Ltd for her 1st class B.Eng. (Jt. Hons) degree in Electronic Engineering and Maths at Nottingham University and a PhD in Digital Signal Processing at Imperial College. She has worked for over 25 years on spoken language processing, developing automatic speech recognition and text-to-speech synthesis systems in industry and academia. As an individual researcher and a leader of multi-disciplinary teams as Languages Manager, Nuance Communications, and Assistant Managing Director, Toshiba Research Europe Ltd, Cambridge Research Lab, she has developed speech systems for over 50 languages and dialects. Her current research focus is on applications for non-native spoken English language assessment and learning and detection of speech and language disorders. She was Secretary of the International Speech Communication Association (ISCA) (2017-2021) and is a member of ISCA, the Institution of Engineering and Technology (IET) and Institute of Electrical and Electronic Engineers (IEEE).
Automated assessment and feedback can support the studies of well over one billion learners of English as a second language (L2) worldwide. Their use for speaking skills is growing as deep learning and the rise of mobile devices makes providing computer assisted language learning (CALL) 24/7 increasingly feasible. One of the key elements in second language acquisition is grammatical construction; as a learner’s proficiency improves so does the complexity of their grammar. Spoken Grammatical Error Correction (SGEC) is designed to detect and correct grammatical errors in free speech. These corrections can either be used to aid assessment of a candidate’s ability or, more directly, as feedback to learners of the errors they are making. Applying grammatical error correction to speech has a number of challenges. Firstly, spoken grammar is not entirely the same as written grammar, and speech contains disfluencies that need to be identified and ignored. Secondly, whilst there are increasing text corpora labelled for GEC the amount for speech is minimal. Additionally, SGEC must be run on transcriptions from ASR which will contain errors. In this talk, we will discuss how these problems can be addressed and deep learning systems built to assess and feedback SGEC. The challenges remaining will also be presented.