Tutorials
Sunday 14 September, 2014
The INTERSPEECH 2014 Organising Committee is pleased to announce the following 8 tutorials presented by distinguished speakers at the conference and will be offered on Sunday, 14 September 2014. All Tutorials will be of three (3) hours duration and require an additional registration fee (separate from the conference registration fee).
The tutorial handouts will be provided electronically, ahead of the tutorials. Please download and print at your convenience, as we will not be providing hard copies of these at the conference.
Morning tutorials |
0930 – 1230 |
Non-speech acoustic event detection and classification |
|
Contribution of MRI to Exploring and Modeling Speech Production |
|
Computational Models for Audiovisual Emotion Perception |
|
The Art and Science of Speech Feature Engineering |
|
Afternoon tutorials |
1400 – 1700 |
Recent Advances in Speaker Diarization |
|
Multimodal Speech Recognition with the AusTalk 3D Audio-Visual Corpus |
|
Semantic Web and Linked Big Data Resources for Spoken Language Processing |
|
Speech and Audio for Multimedia Semantics |
ISCSLP Tutorials @ INTERSPEECH 2014
The ISCSLP 2014 Organising Committee welcomes the INTERSPEECH 2014 delegates to join the 4 ISCSLP tutorials which will be offered on Saturday, 13 September 2014.
Tutorials |
1330 – 1530 |
Adaptation Techniques for Statistical Speech Recognition |
|
Emotion and Mental State Recognition: Features, Models, System Applications and Beyond |
|
|
1600 – 1800 |
Unsupervised Speech and Language Processing via Topic Models |
|
Deep Learning for Speech Generation and Synthesis |
T1 |
Title: Non-speech acoustic event detection and classification |
Abstract: The research in audio signal processing has been dominated by speech research, but most of the sounds in our real-life environments are actually non-speech events such as cars passing by, wind, warning beeps, and animal sounds. These acoustic events contain much information about the environment and physical events that take place in it, enabling novel application areas such as safety, health monitoring and investigation of biodiversity. But while recent years have seen wide-spread adoption of applications such as speech recognition and song recognition, generic computer audition is still in its infancy. Non-speech acoustic events have several fundamental differences to speech, but many of the core algorithms used by speech researchers can be leveraged for generic audio analysis. The tutorial is a comprehensive review of the field of acoustic event detection as it currently stands. The goal of the tutorial is foster interest in the community, highlight the challenges and opportunities and provide a starting point for new researchers. We will discuss what acoustic event detection entails, the commonalities differences with speech processing, such as the large variation in sounds and the possible overlap with other sounds. We will then discuss basic experimental and algorithm design, including descriptions of available databases and machine learning methods. We will then discuss more advanced topics such as methods to deal with temporally overlapping sounds and modelling the relations between sounds. We will finish with a discussion of avenues for future research. |
|
Biography: Tuomas Virtanen is an Academy Research Fellow and an adjunct professor in Department of Signal Processing, Tampere University of Technology (TUT), Finland. He received the M.Sc. and Doctor of Science degrees in information technology from TUT in 2001 and 2006, respectively. He has also been working as a research associate at Cambridge University Engineering Department, UK. He is known for his pioneering work on single-channel sound source separation using non-negative matrix factorization based techniques, and their application to noise-robust speech recognition and music content analysis. In addition to the above topics, his research interests include content analysis of generic audio signals and machine learning. Tuomas Virtanen (www.cs.tut.fi/~tuomasv/) |
|
Biography: Jort Florent Gemmeke is a postdoctoral researcher at the KU Leuven, Belgium. He received the M.Sc degree in physics from the Universiteit van Amsterdam (UvA) in 2005. In 2011, he received the Ph.D degree from the University of Nijmegen on the subject of noise robust ASR using missing data techniques. He is known for pioneering the field of exemplar-based noise robust ASR. His research interests are computer audition, automatic speech recognition, source separation, noise robustness and acoustic modelling, in particular exemplar-based methods and methods using sparse representations. Jort Florent Gemmeke (www.amadana.nl) |
T3 |
Title: Computational Models for Audiovisual Emotion Perception |
Abstract: In this tutorial we will explore engineering approaches to understanding human emotion perception, focusing both on modeling and application. We will highlight both current and historical trends in emotion perception modeling, focusing on both psychological and engineering-driven theories of perception (statistical analyses, data-driven computational modeling, and implicit sensing). The importance of this topic can be appreciated from both an engineering viewpoint, any system that either models human behavior or interacts with human partners must understand emotion perception as it fundamentally underlies and modulates our communication, or from a psychological perspective, emotion perception is also used in the diagnosis of many mental health conditions and is tracked in therapeutic interventions. Research in emotion perception seeks to identify models that describe the felt sense of ‘typical’ emotion expression – i.e., an observer/evaluator’s attribution of the emotional state of the speaker. This felt sense is a function of the methods through which individuals integrate the presented multimodal emotional information. We will cover psychological theories of emotion, engineering models of emotion, and experimental approaches to measure emotion. We will demonstrate how these modeling strategies can be used as a component of emotion classification frameworks and how they can be used to inform the design of emotional behaviors. |
|
Biography: Emily Mower Provost received her B.S. in Electrical Engineering (summa cum laude and with thesis honors) from Tufts University, Boston, MA in 2004 and her M.S. and Ph.D. in Electrical Engineering from the University of Southern California (USC), Los Angeles, CA in 2007 and 2010, respectively. She is a member of Tau-Beta-Pi, Eta-Kappa-Nu, and a member of IEEE and ISCA. She has been awarded the National Science Foundation Graduate Research Fellowship (2004-2007), the Herbert Kunzel Engineering Fellowship from USC (2007-2008, 2010-2011), the Intel Research Fellowship (2008-2010), and the Achievement Rewards For College Scientists (ARCS) Award (2009 – 2010). Her research interests are in human-centered speech and video processing, multimodal interfaces design, and speech-based assistive technology. The goals of her research are motivated by the complexities of human emotion generation and perception. Emily Mower Provost (http://eecs.umich.edu/~emilykmp) |
|
Biography: Carlos Busso is an Assistant Professor at the Electrical Engineering Department of The University of Texas at Dallas (UTD). He received his Ph.D (2008) from USC. Before joining UTD, he was a Postdoctoral Research Associate at the Signal Analysis and Interpretation Laboratory (SAIL), USC. He was selected by the School of Engineering of Chile as the best Electrical Engineer graduated in Chile in 2003 across Chilean Universities. He was selected by the School of Engineering of Chile as the best electrical engineer graduated in 2003. At USC, he received a Provost Doctoral Fellowship from 2003 to 2005 and a Fellowship in Digital Scholarship from 2007 to 2008. At UTD, he leads the Multimodal Signal Processing (MSP) laboratory [http://msp.utdallas.edu]. He received the Hewlett Packard Best Paper Award at the IEEE ICME 2011 (with J. Jain). He is the co-author of the winner paper of the Classifier Sub-Challenge event at the INTERSPEECH 2009 emotion challenge. He is in the organizing committee of different international conferences including Intespeech 2016 (Publication Chair), IEEE FG 2015 (Doctoral Consortium Chair), IEEE ICME 2014 (Publication Chair) and ACM ICMI 2014 (Workshop Chair). His research interests are in digital signal processing, speech and video processing, affective computing and multimodal interfaces. His current research includes the broad areas of audiovisual emotion perception, affective computing, multimodal human-machine interfaces, modeling and synthesis of verbal and nonverbal behaviors, sensing human interaction, and machine learning methods for multimodal processing. |
T5 |
Title: Recent Advances in Speaker Diarization |
Abstract: The tutorial will start with an introduction to speaker diarization giving a general overview of the subject. Afterwards, we will cover the basic background including feature extraction, and common modeling techniques such as GMMs and HMMs. Then, we will discuss the first processing step usually done in speaker diarization which is voice activity detection. We will consequently describe the classic approaches for speaker diarization which are widely used today. We will then introduce state-of-the-art techniques in speaker recognition required to understand modern speaker diarization techniques. Following, we will describe approaches for speaker diarization using advanced representation methods (supervectors, speaker factors, i-vectors) and we will describe supervised and unsupervised learning techniques used for speaker diarization. We will also discuss issues such as coping with unknown number of speakers, detecting and dealing with overlapping speech, diarization confidence estimation, and online speaker diarization. Finally we will discuss two recent works: exploiting a-prioiri acoustic information (such as processing a meeting when some of the participants are known in advanced to the system, and training data is available for them), The second recent work is modeling speaker-turn dynamics. If time permits, we will also discuss concepts such as multi-modal diarization and using TDOA (time difference of arrival) for diarization of meetings. |
|
Biography: Dr. Hagai Aronowitz received the B.Sc. degree in Computer Science, Mathematics and Physics from the Hebrew University, Jerusalem, Israel in 1994, and the M.Sc. degree, Summa Cum Laude and Ph.D. degree, both in Computer Science from Bar-Ilan University, Ramat-Gan, Israel, in 2000 and 2006 respectively. From 1994 to 2000 he was a senior researcher in speech and speaker recognition for the Israeli Government. From 2000 to 2005 he was a researcher in speech and speaker recognition for small footprint mobile devices at Intel Speech Research Lab. In 2006-2007 he has been a postdoctoral fellow in the advanced LVCSR group in IBM T. J. Watson Research Center, Yorktown Heights, NY. Since 2007 Dr. Aronowitz is a research staff member at IBM Haifa Research Lab leading the voice Biometrics team. His research interests include speaker identification, speaker diarization, and Biometrics. Some of the techniques that he co-invented are well known to the speech community like GMM supervectors, intersession variability modeling for speaker recognition, and within-speaker variability modeling for speaker diarization. Dr. Aronowitz is an author of more than 40 scientific publications in peer reviewed conferences and journals and an inventor of 10 patents. Hagai Aronowitz https://sites.google.com/site/aronowitzh/ |
ISCSLP-T3 |
Title: Unsupervised Speech and Language Processing via Topic Models |
Abstract: In this tutorial, we will present state-of-art machine learning approaches for speech and language processing with highlight on the unsupervised methods for structural learning from the unlabeled sequential patterns. In general, speech and language processing involves extensive knowledge of statistical models. We require designing a flexible, scalable and robust system to meet heterogeneous and nonstationary environments in the era of big data. This tutorial starts from an introduction of unsupervised speech and language processing based on factor analysis and independent component analysis. The unsupervised learning is generalized to a latent variable model which is known as the topic model. The evolution of topic models from latent semantic analysis to hierarchical Dirichlet process, from non-Bayesian parametric models to Bayesian nonparametric models, and from single-layer model to hierarchical tree model shall be surveyed in an organized fashion. The inference approaches based on variational Bayesian and Gibbs sampling are introduced. We will also present several case studies on topic modeling for speech and language applications including language model, document model, retrieval model, segmentation model and summarization model. At last, we will point out new trends of topic models for speech and language processing. |
|
Biography: Jen-Tzung Chien received his Ph.D. degree in electrical engineering from National Tsing Hua University, Hsinchu, in 1997. During 1997-2012, he was with the National Cheng Kung University, Tainan. Since 2012, he has been with the Department of Electrical and Computer Engineering, National Chiao Tung University (NCTU), Hsinchu, where he is currently a Distinguished Professor. He serves as an adjunct professor in the Department of Computer Science, NCTU. He held the Visiting Researcher positions at the Panasonic Technologies Inc., Santa Barbara, CA, the Tokyo Institute of Technology, Tokyo, Japan, the Georgia Institute of Technology, Atlanta, GA, the Microsoft Research Asia, Beijing, China, and the IBM T. J. Watson Research Center, Yorktown Heights, NY. His research interests include machine learning, speech recognition, information retrieval and blind source separation. He served as the associate editor of the IEEE Signal Processing Letters in 2008-2011, the guest editor of the IEEE Transactions on Audio, Speech and Language Processing in 2012, the organization committee member of the ICASSP 2009, and the area coordinator of the Interspeech 2012. He is appointed as the APSIPA Distinguished Lecturer for 2012-2013. He received the Distinguished Research Award from the National Science Council in 2006 and 2010. He was a co-recipient of the Best Paper Award of the IEEE Automatic Speech Recognition and Understanding Workshop in 2011. Dr. Chien has served as the tutorial speaker for ICASSP 2012 at Kyoto, Interspeech 2013 at Lyon, and APSIPA 2013 at Kaohsiung. Jen-Tzung Chien (http://chien.cm.nctu.edu.tw) |
ISCSLP-T4 |
Title: Deep Learning for Speech Generation and Synthesis |
Abstract: Deep learning, which can represent high-level abstractions in data with an architecture of multiple non-linear transformation, has made a huge impact on automatic speech recognition (ASR) research, products and services. However, deep learning for speech generation and synthesis (i.e., text-to-speech), which is an inverse process of speech recognition (i.e., speech-to-text), has not generated the similar momentum as it is for ASR yet. Recently, motivated by the success of Deep Neural Networks in speech recognition, some neural network based research attempts have been tried successfully on improving the performance of statistical parametric based speech generation/synthesis. In this tutorial, we focus on deep learning approaches to the problems in speech generation and synthesis, especially on Text-to-Speech (TTS) synthesis and voice conversion. First, we give a review for the current main stream of statistical parametric based speech generation and synthesis, or the GMM-HMM based speech synthesis and GMM-based voice conversion with emphasis on analyzing the major factors responsible for the quality problems in the GMM-based voice synthesis/conversion and the intrinsic limitations of a decision-tree based, contextual state clustering and state-based statistical distribution modeling. We then present the latest deep learning algorithms for feature parameter trajectory generation, in contrast to deep learning for recognition or classification. We cover common technologies in Deep Neural Network (DNN) and improved DNN: Mixture Density Networks (MDN), Recurrent Neural Networks (RNN) with Bidirectional Long Short Term Memory (BLSTM) and Conditional RBM (CRBM). Finally, we share our research insights and hand-on experience on building speech generation and synthesis systems based upon deep learning algorithms. |
|
Biography: Yao Qian is a Lead Researcher in Speech Group, Microsoft Research Asia. She received her Ph.D in the Dept. of EE, The Chinese University of Hong Kong, in 2005. She joined Microsoft research Asia in September, 2005, right after receiving her PhD. Her research interests are in spoken language processing, including TTS speech synthesis and automatic speech recognition. Her recent research projects include speech synthesis, voice transformation, prosody modeling and Computer-Assisted Language Learning (CALL). She has over 50 publications on international journals and conferences. She also has ten U.S. patent applications, five issued. She has been recognized within Microsoft and in the speech research community for her contributions to TTS and many other speech technologies. She is a senior member of IEEE and a member of ISCA. Yao Qia (http://research.microsoft.com/en-us/people/yaoqian/) |
|
Biography: Frank K. Soong is a Principal Researcher, Speech Group, Microsoft Research Asia (MSRA), Beijing, China, where he works on fundamental research on speech and its practical applications. His professional research career spans over 30 years, first with Bell Labs, US, then with ATR, Japan, before joining MSRA in 2004. At Bell Labs, he worked on stochastic modeling of speech signals, optimal decoder algorithm, speech analysis and coding, speech and speaker recognition. He was responsible for developing the recognition algorithm which was developed into voice-activated mobile phone products rated by the Mobile Office Magazine (Apr. 1993) as the "outstandingly the best". He is a co-recipient of the Bell Labs President Gold Award for developing the Bell Labs Automatic Speech Recognition (BLASR) software package. He has served as a member of the Speech and Language Technical Committee, IEEE Signal Processing Society and other society functions, including Associate Editor of the IEEE Speech and Audio Transactions and chairing IEEE International Workshop. He published extensively with more than 200 papers and co-edited a widely used reference book, Automatic Speech and Speech Recognition- Advanced Topics, Kluwer, 1996. He is a visiting professor of the Chinese University of Hong Kong (CUHK) and a few other top-rated universities in China. He is also the co-Director of the MSRA-CUHK Joint Research Lab. He got his BS, MS and PhD from National Taiwan Univ., Univ. of Rhode Island, and Stanford Univ, all in Electrical Eng. He is an IEEE Fellow. |