Keynote Speech 1

ISCA Distinguished Lecture on
Generative Adversarial Networks (GANs) for Speech Technology

Abstract

Adversarial training or Generative Adversarial Networks (GANs) is the most interesting and technologically challenging idea (pioneered by I. J. Goodfellow in 2014) in the field of machine learning. GAN is a recent framework for estimating generative models via the adversarial training mechanism in which we simultaneously train two models, namely,  a generator G that captures the (true) data distribution and a discriminator model D that estimate the probability that a sample came from training data rather than G. The training procedure of GANs (which is challenging w.r.t. convergence, mode collapse, etc.) for G is to maximize the probability of D making mistake. This framework corresponds to a mini-max two-player game (such as thief-Police game!). In the function space of arbitrary differentiable functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to ½ everywhere (that D is fooled by generator).  When G and D are defined by multilayer perceptron, the entire system can be trained with back propagation. 

GANs are widely used in various applications (first used in image processing and computer vision and recently in speech areas). In particular, image  (sample) generation, single image super resolution, text-to-image synthesis, and several speech technology applications (mostly after 2017), such as voice conversion, Non-audible Murmur (NAM)-to- whisper conversion, whisper-to-normal conversion, voice imitation, speech enhancement, Text-to-Speech (TTS) synthesis, and a very recent application to speaker recognition, natural language generation and data augmentation (for Automatic Speech Recognition (ASR) and low-resource languages), and domain adaptation. The objective of this talk is to first understand the fundamentals of GANs w.r.t. motivation, applications, various GAN architectures along with future research directions. The talk will present a case study on how GANs could be potentially useful to improve the performance of cross-lingual speaker recognition for Indian and other Asian languages.  Finally, the talk will bring out several open research problems (relationship with variational autoencoders (VAEs and their asymptotic consistency, convergence of GANs), that needs immediate attention to fully realize the potential of GANs in several technological applications.

This talk is focused on the application of the technology on Asian language processing.

Biography of the Speaker:

Hemant A. Patil received Ph.D. degree from the Indian Institute of Technology (IIT), Kharagpur, India, in July 2006. Since 2007, he has been a faculty member at DA-IICT Gandhinagar, India and developed Speech Research Lab recognized as ISCA speech labs at DA-IICT. Dr. Patil is member of ISCA, IEEE, IEEE Signal Processing Society, IEEE Circuits and Systems Society, EURASIP, APSIPA and an affiliate member of IEEE SLTC. He is regular reviewer for ICASSP and INTERSPEECH, Speech Communication, Elsevier, Computer Speech and Language, Elsevier and Int. J. Speech Tech, Springer, Circuits, Systems and Signal Processing, Springer. He has published around 250+ research publications in national and international conferences/journals/book chapters. He visited department of ECE, University of Minnesota, Minneapolis, USA (May-July, 2009) as short term scholar. He has been associated (as PI) with three MeitY sponsored projects in ASR, TTS and QbESTD. He was co-PI for DST sponsored project on India-Digital Heritage (IDH)-Hampi. His research interests include speech and speaker recognition, analysis of spoofing attacks, TTS, and infant cry analysis. He has received DST Fast Track Award for Young Scientists for infant cry analysis. He has coedited four books with Dr. Amy Neustein (EIC, IJST Springer) with titles, Forensic Speaker Recognition (Springer, 2011), Signal and Acoustic Modeling for Speech and Communication Disorders (DE GRUYTER, 2018), Voice Technologies for Speech Reconstruction and Enhancement (DE GRUYTER, 2020), and Acoustic Analysis of Pathologies from Infant to Young Adulthood (DE GRUYTER, 2020).

Dr. Patil has taken a lead role in organizing several ISCA supported events, such as summer/winter schools/CEP workshops (on theme as speaker and language recognition, speech source modeling, text-to-speech synthesis, speech production-perception link, advances in speech processing) and progress review meetings for two MeitY consortia projects all at DA-IICT Gandhinagar. Dr. Patil has supervised 05 doctoral and 42 M.Tech. theses (all in speech processing area). Presently, he is supervising 03 doctoral and 03 masters students. Recently, he offered a joint tutorial with Prof. Haizhou Li during Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2017, and INTERSPEECH 2018. He offered a joint tutorial with H. Kawahara on the topic, “Voice Conversion: Challenges and Opportunities,” during APSIPA ASC 2018, Honolulu, USA. He has been selected as APSIPA Distinguished Lecturer (DL) for 2018-2019 and he has 20 APSIPA DLs in four countries, namely, India, Singapore, China, and Canada. Recently, he is selected as ISCA Distinguished Lecturer (DL) for 2020-2021 and delivered 05 ISCA DLs in India. Recently, he is invited to deliver ISCA DL during overview session of APSIPA ASC 2020, New Zealand, Dec. 7-10.  2020. 

Homepage of Prof. Hemant A. Patil: https://sites.google.com/site/hemantpatildaiict/

Speech Research Lab @ DA-IICT Gandhinagar: https://sites.google.com/site/speechlabdaiict/