Plenary Speaker 1
Speaker: Dr Jinyu Li, Partner Applied Science Manager, Microsoft
Title: Advancing end-to-end automatic speech recognition and beyond
Biography: Jinyu Li received the Ph.D. degree from Georgia Institute of Technology and joined Microsoft in 2008. He has led the development of deep-learning based automatic speech recognition technologies including both hybrid models and the most recent end-to-end models for Microsoft products since 2012, which enables Microsoft’s success in industry with state-of-the-art speech recognition products in Cortana, Teams, Xbox, Skype, etc. Currently, he is a Partner Applied Science Manager in Microsoft, leading a team to design and improve advanced speech modeling algorithms and technologies. His major research interests cover several topics in speech processing, including end-to-end modeling, deep learning, acoustic modeling, speech separation, and noise robustness, etc. He is the leading author of the book “Robust Automatic Speech Recognition — A Bridge to Practical Applications”, Academic Press, 2015. He has been a member of IEEE Speech and Language Processing Technical Committee and has served as the area chair of ICASSP since 2017. He also served as the associate editor of IEEE/ACM Transactions on Audio, Speech and Language Processing from 2015 to 2020. He was elected as the Industrial Distinguished Leader at Asia-Pacific Signal and Information Processing Association in 2021.
The speech community is transitioning from hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models achieved state-of-the-art results in most benchmarks in terms of ASR accuracy, there are lots of practical factors that affect the production model deployment decision, including low-latency streaming, leveraging text-only data, and handling overlapped speech etc. Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized.
In this talk, I will overview the recent advances in E2E models with the focus on technologies addressing those challenges from the perspective of industry. To design a high-accuracy low-latency E2E model, a masking strategy was introduced into Transformer Transducer. I will discuss technologies which can leverage text-only data for general model training via pretraining and adaptation to a new domain via augmentation and factorization. Then, I will extend E2E modeling for streaming multi-talker ASR. I will also show how we go beyond ASR by extending the learning in E2E ASR into a new area like speech translation and build high-quality E2E speech translation models even without any human labeled speech translation data. Finally, I will conclude the talk with some new research opportunities we may work on.
Plenary Speaker 2
Speaker: Prof Chng Eng Siong, Associate Professor, Nanyang Technological University
Dr Chng Eng Siong is currently an Associate Professor in the School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore. He joined NTU in 2003. He has been Assistant Chair of Graduate Students for SCSE since 2019. Prior to joining NTU, he worked in Knowles Electronics (2001-2002), Lernout and Hauspie (1999-2000, Belgium), Institute of Infocomm Research (1996-1999, I2R, Singapore), and RIKEN (1996, Japan).
Dr Chng received both BEng (Hons) and PhD from Edinburgh University, U.K. in 1991 and 1996 respectively. His PhD research thesis was supervised by Bernard Mulgrew, Peter Grant and Chen Sheng.
His current research interests include NLP, machine learning, speech and signal processing. To date, he has been Principal Investigator of several research grants awarded by Alibaba, Webank, AISG, NTU-Rolls Royce, Mindef, MOE and AStar with a total funding amount of over S$10 million under the Speech and Language Laboratory at SCSE. He supervised 17 PhD Students and 10 MEng Students. His publications include 2 edited books and over 100 journal/conference papers.
He served as the publication chair for 5 international conferences (Human Agent Interaction 2016, INTERSPEECH 2014, APSIPA-2010, APSIPA-2011, ISCSLP-2006), and have been an associate editor for IEICE (special issue 2012), a reviewer for Speech Communications, Eupsico, IEEE Trans Man,System and Cybernectics Part B, Journal of Signal Processing System, ACM Multimedia Systems, IEEE Trans Neural Network, IEEE Trans CAS-II, and Signal Processing.
He is a PhD supervisor and PI for 4 AIR projects in Alibaba NTU-Joint Research institute (speech research) Speech research. He was the recipient of the Tan Chin Tuan fellowship (2007) to visit Tsinghua University, the JSPS travel grant award (2008) to visit Tokyo Institute of Technology, and the Merlion Singapore-France research collaboration award in 2009.