Sunday 14 September, 2014

The INTERSPEECH 2014 Organising Committee is pleased to announce the following 8 tutorials presented by distinguished speakers at the conference and will be offered on Sunday, 14 September 2014. All Tutorials will be of three (3) hours duration and require an additional registration fee (separate from the conference registration fee).

Tutorial Fees

The tutorial handouts will be provided electronically, ahead of the tutorials. Please download and print at your convenience, as we will not be providing hard copies of these at the conference.

Morning tutorials

0930 – 1230


Non-speech acoustic event detection and classification
- Tuomas Virtanen and Jort F. Gemmeke


Contribution of MRI to Exploring and Modeling Speech Production
- Kiyoshi Honda and Jianwu Dang


Computational Models for Audiovisual Emotion Perception
- Emily Mower Provost and Carlos Busso


The Art and Science of Speech Feature Engineering
- Samuel Thomas and Sriram Ganapathy

Afternoon tutorials

1400 – 1700


Recent Advances in Speaker Diarization
- Hagai Aronowitz


Multimodal Speech Recognition with the AusTalk 3D Audio-Visual Corpus
- Roberto Togneri, Mohammed Bennamoun and Chao Sui


Semantic Web and Linked Big Data Resources for Spoken Language Processing
- Dilek Hakkani-Tur and Larry Heck


Speech and Audio for Multimedia Semantics
- Florian Metze and Koichi Shinoda


The ISCSLP 2014 Organising Committee welcomes the INTERSPEECH 2014 delegates to join the 4 ISCSLP tutorials which will be offered on Saturday, 13 September 2014.


1330 – 1530


Adaptation Techniques for Statistical Speech Recognition
- Kai Yu


Emotion and Mental State Recognition: Features, Models, System Applications and Beyond
- Chung-Hsien Wu, Hsin-Min Wang, Julien Epps and Vidhyasaharan Sethu


1600 – 1800


Unsupervised Speech and Language Processing via Topic Models
- Jen-Tzung Chien


Deep Learning for Speech Generation and Synthesis
- Yao Qian and Frank K. Soong


Title: Non-speech acoustic event detection and classification
Presenters: Tuomas Virtanen (Tampere University of Technology, Finland) and Jort F. Gemmeke (KU Leuven, Belgium)

Abstract: The research in audio signal processing has been dominated by speech research, but most of the sounds in our real-life environments are actually non-speech events such as cars passing by, wind, warning beeps, and animal sounds. These acoustic events contain much information about the environment and physical events that take place in it, enabling novel application areas such as safety, health monitoring and investigation of biodiversity. But while recent years have seen wide-spread adoption of applications such as speech recognition and song recognition, generic computer audition is still in its infancy.

Non-speech acoustic events have several fundamental differences to speech, but many of the core algorithms used by speech researchers can be leveraged for generic audio analysis. The tutorial is a comprehensive review of the field of acoustic event detection as it currently stands. The goal of the tutorial is foster interest in the community, highlight the challenges and opportunities and provide a starting point for new researchers. We will discuss what acoustic event detection entails, the commonalities differences with speech processing, such as the large variation in sounds and the possible overlap with other sounds. We will then discuss basic experimental and algorithm design, including descriptions of available databases and machine learning methods. We will then discuss more advanced topics such as methods to deal with temporally overlapping sounds and modelling the relations between sounds. We will finish with a discussion of avenues for future research.

Biography: Tuomas Virtanen is an Academy Research Fellow and an adjunct professor in Department of Signal Processing, Tampere University of Technology (TUT), Finland. He received the M.Sc. and Doctor of Science degrees in information technology from TUT in 2001 and 2006, respectively. He has also been working as a research associate at Cambridge University Engineering Department, UK. He is known for his pioneering work on single-channel sound source separation using non-negative matrix factorization based techniques, and their application to noise-robust speech recognition and music content analysis. In addition to the above topics, his research interests include content analysis of generic audio signals and machine learning.

Tuomas Virtanen (

Biography: Jort Florent Gemmeke is a postdoctoral researcher at the KU Leuven, Belgium. He received the M.Sc degree in physics from the Universiteit van Amsterdam (UvA) in 2005. In 2011, he received the Ph.D degree from the University of Nijmegen on the subject of noise robust ASR using missing data techniques. He is known for pioneering the field of exemplar-based noise robust ASR. His research interests are computer audition, automatic speech recognition, source separation, noise robustness and acoustic modelling, in particular exemplar-based methods and methods using sparse representations.

Jort Florent Gemmeke (

[Back to Top]


Title: Contribution of MRI to Exploring and Modeling Speech Production
Presenters: Kiyoshi HONDA (Tianjin University, China) and Jianwu DANG (JAIST, Japan)

Abstract: Magnetic resonance imaging (MRI) provides us a magic vision to look into the human body in various ways not only with static imaging but also with motion imaging. MRI has been a powerful technique for speech research to study finer anatomy of the speech organs or to visualize true vocal tracts in three dimensions. Inherent problems of slow image acquisition for speech tasks or insufficient signal-to-noise ratio for microscopic observation have been the cost for researchers to search for task-specific imaging techniques. The recent advances of the 3-Tesla technology suggest more practical solutions to broader applications of MRI by overcoming previous technical limitations. In this joint tutorial in two parts, we summarize our previous effort to accumulate scientific knowledge with MRI and to advance speech modeling studies for future development. Part I, given by Kiyoshi Honda, introduces how to visualize the speech organs and vocal tracts by presenting techniques and data for finer static imaging, synchronized motion imaging, surface marker tracking, real-time imaging, and vocal-tract mechanical modeling. Part 2, presented by Jianwu Dang, focuses on applications of MRI for phonetics of Mandarin vowels, acoustics of the vocal tracts with side branches, analysis and simulation in search of talker characteristics, physiological modeling of the articulatory system, and motor control paradigm for speech articulation.

Biography: Professor Kiyoshi HONDA , after graduation from Nara Medical University on 1971, spent four year at University of Tokyo Hospital and then started voice and speech research at the Research Institute of Logopedics and Phoniatrics, University of Tokyo, and at Haskins Laboratories, USA, as a research associate. He moved to Kanazawa Institute of Technology to be an associate professor at Dept. Electronics on 1986. Then, he joined Advanced Telecommunications Research Institute International (ATR-I) on 1991 to continue voice and speech research as a supervisor. He also spent three years at University of Wisconsin for the X-ray microbeam project as a senior researcher. He moved to Phonetics and Phonology Lab, CNRS-University of Paris III, France, on 2006, and he is currently a professor of Tianjin University, School of Computer Science and Technology, under the foreigner 1000-plan program. His research interest includes anatomy/physiology of speech organs, acoustic/articulatory phonetics, electronic instrumentation techniques, and MRI-based visualization techniques. He has developed stereo-endoscopy, high-speed digital imaging, and vocal-tract imaging methods. His current project for One-Thousand Talents Plan is “Multisensory Instrumentation for normal and pathological speech”.

Kiyoshi Honda has been working on MRI-based studies on voice and speech mechanisms since 1991. The major work on the tutorial topic includes static and dynamic MRI techniques for visualizing voice and speech organs, fundamental frequency control mechanisms, articulatory muscle functions, and roles of the hypopharyngeal cavities and interdental space.

Biography: Professor Jianwu Dang graduated from Tsinghua Univ., China, in 1982, and received his M.S. degree at the same university in 1984. He worked for Tianjin University as a lecture from 1984 to 1988. He was awarded the PhD degree from Shizuoka University, Japan, in 1992. He moved to ATR Human Information Processing Labs., Japan, as a senior researcher from 1992 to 2001. He joined the University of Waterloo, Canada, as a visiting scholar for one year from 1998. Since 2001, he has worked for Japan Advanced Institute of Science and Technology (JAIST) as a professor. He joined the Institute of Communication Parlee (ICP), Center of National Research Scientific, France, as a research scientist the first class from 2002 to 2003. Since 2009, he has joined Tianjin University, China as the dean of the School of Computer Science and Technology. He was a chair professor in the Department of Computer Science and Technology, Tsinghua, China, from 2010 to 2012. He was awarded the scholar of One-Thousand Talents Plan, and also the principle scientist of the National Basic Research Program (973 Program). His research interests are in all the fields of speech production, speech synthesis, and speaker recognition. He built MRI-based physiological models for speech and swallowing, and endeavours to apply these models to clinical issues.

Jianwu Dang has worked on MRI-based studies for articulatory and acoustic modeling since 1992. The major work for the tutorial topics includes acoustic roles of the nasal/paranasal cavities and the piriform fossa, construction of a 3D physiological articulatory model for speech synthesis and swallowing simulation, motor control of the articulatory model, and individual adaptation of the articulatory model.

[Back to Top]


Title: Computational Models for Audiovisual Emotion Perception
Presenters: Emily Mower Provost (University of Michigan ) and Carlos Busso (University of Texas, Dallas)

Abstract: In this tutorial we will explore engineering approaches to understanding human emotion perception, focusing both on modeling and application. We will highlight both current and historical trends in emotion perception modeling, focusing on both psychological and engineering-driven theories of perception (statistical analyses, data-driven computational modeling, and implicit sensing). The importance of this topic can be appreciated from both an engineering viewpoint, any system that either models human behavior or interacts with human partners must understand emotion perception as it fundamentally underlies and modulates our communication, or from a psychological perspective, emotion perception is also used in the diagnosis of many mental health conditions and is tracked in therapeutic interventions. Research in emotion perception seeks to identify models that describe the felt sense of ‘typical’ emotion expression – i.e., an observer/evaluator’s attribution of the emotional state of the speaker. This felt sense is a function of the methods through which individuals integrate the presented multimodal emotional information. We will cover psychological theories of emotion, engineering models of emotion, and experimental approaches to measure emotion. We will demonstrate how these modeling strategies can be used as a component of emotion classification frameworks and how they can be used to inform the design of emotional behaviors.

Biography: Emily Mower Provost received her B.S. in Electrical Engineering (summa cum laude and with thesis honors) from Tufts University, Boston, MA in 2004 and her M.S. and Ph.D. in Electrical Engineering from the University of Southern California (USC), Los Angeles, CA in 2007 and 2010, respectively. She is a member of Tau-Beta-Pi, Eta-Kappa-Nu, and a member of IEEE and ISCA. She has been awarded the National Science Foundation Graduate Research Fellowship (2004-2007), the Herbert Kunzel Engineering Fellowship from USC (2007-2008, 2010-2011), the Intel Research Fellowship (2008-2010), and the Achievement Rewards For College Scientists (ARCS) Award (2009 – 2010). Her research interests are in human-centered speech and video processing, multimodal interfaces design, and speech-based assistive technology. The goals of her research are motivated by the complexities of human emotion generation and perception.

Emily Mower Provost (

Biography: Carlos Busso is an Assistant Professor at the Electrical Engineering Department of The University of Texas at Dallas (UTD). He received his Ph.D (2008) from USC. Before joining UTD, he was a Postdoctoral Research Associate at the Signal Analysis and Interpretation Laboratory (SAIL), USC. He was selected by the School of Engineering of Chile as the best Electrical Engineer graduated in Chile in 2003 across Chilean Universities. He was selected by the School of Engineering of Chile as the best electrical engineer graduated in 2003. At USC, he received a Provost Doctoral Fellowship from 2003 to 2005 and a Fellowship in Digital Scholarship from 2007 to 2008. At UTD, he leads the Multimodal Signal Processing (MSP) laboratory []. He received the Hewlett Packard Best Paper Award at the IEEE ICME 2011 (with J. Jain). He is the co-author of the winner paper of the Classifier Sub-Challenge event at the INTERSPEECH 2009 emotion challenge. He is in the organizing committee of different international conferences including Intespeech 2016 (Publication Chair), IEEE FG 2015 (Doctoral Consortium Chair), IEEE ICME 2014 (Publication Chair) and ACM ICMI 2014 (Workshop Chair). His research interests are in digital signal processing, speech and video processing, affective computing and multimodal interfaces. His current research includes the broad areas of audiovisual emotion perception, affective computing, multimodal human-machine interfaces, modeling and synthesis of verbal and nonverbal behaviors, sensing human interaction, and machine learning methods for multimodal processing.

[Back to Top]


Title: The Art and Science of Speech Feature Engineering
Presenters: Sriram Ganapathy and Samuel Thomas, IBM T.J. Watson Research Center, USA

Abstract: With significant advances in mobile technology and audio sensing devices, there is a fundamental need to describe vast amounts of audio data in terms of well representative lower dimensional descriptors for efficient automatic processing. The extraction of these signal representations, also called features, constitutes the first step in processing a speech signal. The art and science of feature engineering relates to addressing the two inherent challenges - extracting sufficient information from the speech signal for the task at hand and suppressing the unwanted redundancies for computational efficiency and robustness. The area of speech feature extraction combines a wide variety of disciplines like signal processing, machine learning, psychophysics, information theory, linguistics and physiology. It has a rich history spanning more than five decades and has seen tremendous advances in the last few years. This has propelled the transition of the speech technology from controlled environments to millions of end user applications.

In this tutorial, we review the evolution of speech feature processing methods, summarize the recent advances of the last two decades and provide insights into the future of feature engineering. This will include the discussions on the spectral representation methods developed in the past, human auditory motivated techniques for robust speech processing, data driven unsupervised features like ivectors and recent advances in deep neural network based techniques. With experimental results, we will also illustrate the impact of these features for various state-of-the-art speech processing systems. The future of speech signal processing will need to address various robustness issues in complex acoustic environments while being able to derive useful information from big data.

Biography: Sriram Ganapathy is a research staff member at the IBM T.J. Watson Research Center, Yorktown Heights, USA, where he is working on signal analysis methods for radio communication speech in highly degraded environments. He received his Doctor of Philosophy from the Center for Language and Speech Processing, Johns Hopkins University in 2011 with Prof. Hynek Hermansky. Prior to this, he obtained his Bachelor of Technology from College of Engineering, Trivandrum, India in 2004 and Master of Engineering from the Indian Institute of Science, Bangalore in 2006. He has also worked as a Research Assistant in Idiap Research Institute, Switzerland from 2006 to 2008 contributing to various speech and audio projects. His research interests include signal processing, machine learning and robust methodologies for speech and speaker recognition. He has obtained over 50 publications in leading international journals and conferences in the area of speech and audio processing along with several patents

Biography: Samuel Thomas completed his B.Tech degree in Computer Engineering from the Cochin University of Science and Technology, India (2000) and M.S degree in Computer Science and Engineering from the Indian Institute of Technology Madras, India (2006) before receiving his Doctor of Philosophy degree from the Johns Hopkins University, Baltimore in 2012. His doctoral dissertation was supervised by Prof. Hynek Hermansky and was titled “Data-driven Neural Network Based Feature Front-ends for Automatic Speech Recognition”. Since graduation, he has been at the IBM T.J. Watson Research Center, New York as a post-doctoral researcher with the Advanced LVCSR group. In the past, he has worked on several speech research projects and workshops with the Center for Language and Speech Processing (CLSP) at JHU, the Idiap Research Institute, Switzerland and the TeNeT group, IIT Madras. His research interests include speech processing and machine learning for speech recognition, speech synthesis and speaker recognition. He has over 40 publications in various international journal and conferences in the area of speech and audio processing.

[Back to Top]


Title: Recent Advances in Speaker Diarization
Presenters: Hagai Aronowitz , IBM Research, Haifa, Israel

Abstract: The tutorial will start with an introduction to speaker diarization giving a general overview of the subject. Afterwards, we will cover the basic background including feature extraction, and common modeling techniques such as GMMs and HMMs. Then, we will discuss the first processing step usually done in speaker diarization which is voice activity detection. We will consequently describe the classic approaches for speaker diarization which are widely used today. We will then introduce state-of-the-art techniques in speaker recognition required to understand modern speaker diarization techniques. Following, we will describe approaches for speaker diarization using advanced representation methods (supervectors, speaker factors, i-vectors) and we will describe supervised and unsupervised learning techniques used for speaker diarization. We will also discuss issues such as coping with unknown number of speakers, detecting and dealing with overlapping speech, diarization confidence estimation, and online speaker diarization. Finally we will discuss two recent works: exploiting a-prioiri acoustic information (such as processing a meeting when some of the participants are known in advanced to the system, and training data is available for them), The second recent work is modeling speaker-turn dynamics. If time permits, we will also discuss concepts such as multi-modal diarization and using TDOA (time difference of arrival) for diarization of meetings.

Biography: Dr. Hagai Aronowitz received the B.Sc. degree in Computer Science, Mathematics and Physics from the Hebrew University, Jerusalem, Israel in 1994, and the M.Sc. degree, Summa Cum Laude and Ph.D. degree, both in Computer Science from Bar-Ilan University, Ramat-Gan, Israel, in 2000 and 2006 respectively. From 1994 to 2000 he was a senior researcher in speech and speaker recognition for the Israeli Government. From 2000 to 2005 he was a researcher in speech and speaker recognition for small footprint mobile devices at Intel Speech Research Lab. In 2006-2007 he has been a postdoctoral fellow in the advanced LVCSR group in IBM T. J. Watson Research Center, Yorktown Heights, NY. Since 2007 Dr. Aronowitz is a research staff member at IBM Haifa Research Lab leading the voice Biometrics team. His research interests include speaker identification, speaker diarization, and Biometrics. Some of the techniques that he co-invented are well known to the speech community like GMM supervectors, intersession variability modeling for speaker recognition, and within-speaker variability modeling for speaker diarization. Dr. Aronowitz is an author of more than 40 scientific publications in peer reviewed conferences and journals and an inventor of 10 patents.

Hagai Aronowitz

[Back to Top]


Title: Multimodal Speech Recognition with the AusTalk 3D Audio-Visual Corpus
Presenters: Roberto Togneri, Mohammed Bennamoun and Chao (Luke) Sui, University of Western Australia

Abstract: This tutorial will provide attendees a brief overview of 3D based AVSR research. In this tutorial, attendees will learn how to use the newly developed 3D based audio visual data corpus we derived from the AusTalk corpus ( for audio-visual speech/speaker recognition. In addition, we also plan to introduce some results using this newly developed 3D audio-visual data corpus, which show that there is a significant speech accuracy increase by integrating both depth-level and grey-level visual features. In the first part of the tutorial, we will review some recent works published in the last decade, so that attendees can obtain an overview of the fundamental concepts and challenges in this field. In the second part of the tutorial, we will briefly describe the recording protocol and contents of the 3D data corpus, and show attendees how to use this corpus for their own research. In the third part of this tutorial, we will present our results using the 3D data corpus. The experimental results show that, compared with the conventional AVSR based on the audio and grey-level visual features, the integration of grey and depth visual information can boost the AVSR accuracy significantly. Moreover, we will also experimentally explain why adding depth information can benefit the standard AVSR systems. Eventually, through our tutorial, we hope we can inspire more researchers in the community to contribute to this exciting research.

Biography: Professor Roberto Togneri received the B.E. degree in 1985, and the Ph.D degree in 1989 both from the University of Western Australia. He joined the School of Electrical, Electronic and Computer Engineering at The University of Western Australia in 1988, where he is now currently a Professor. Prof Togneri is a member of the Signals and Systems Engineering Research Group and heads the Signal and Information Processing Lab. His research activities include signal processing and robust feature extraction of speech signals, statistical and neural network models for speech and speaker recognition, audio-visual recognition and biometrics, and related aspects of communications, information retrieval, and pattern recognition. He has published over 100 refereed journal and conference papers in the areas of signals and information systems, was the chief investigator on two Australian Research Council Discovery Project research grants from 2010 to 2013, and is currently an Associate Editor for IEEE Signal Processing Magazine Lecture Notes and Editor for IEEE Transactions on Speech, Audio and Language Processing.

Biography: Winthrop Professor Mohammed Bennamoun received the M.E. degree from Queen's University in 1988, and the Ph.D degree from Queensland University of Technology in 1996. Since 1996, he served as a full time academic covering teaching, research and administrative roles, and currently is a Winthrop Professor (Level E) in University of Western Australia (UWA). Winthrop Professor Mohammed Bennamoun served as Head of the School of Computer Science and Software Engineering, UWA for over 5 years. He has significantly contributed to the field of 2D and 3D computer vision, image processing and machine learning, evidenced by his list of publications including a Springer monograph in the area of object recognition (the topic of this application), several invited and keynote papers, book chapters, over 50 journal papers and more than 150 internationally refereed conference papers. His publications were cited more than 3000 times (Google Scholar) in highly-regarded Computer Vision Literature surveys, refereed conference and journal papers. In the last five years, he managed four Australian Research Council Discovery Projects, two Australian Research Council Linkage Projects, and one LIEF research grants, with strong research outputs and high quality publications with all of my collaborators.

Biography: Chao Sui recieved his B.Eng. degree in measurement and control technology from Jiangxi University of Science and Technology, P. R. China in 2009. Subsquentially, he received M.Eng. (research) degree from University of New South Wales, Australia in 2011, where he worked with Dr Ngai Ming Kwok on appearance-based hand gesture recognition. He is currently pursuing his Ph.D. degree at University of Western Australia, advised by Professor Roberto Togneri and Winthrop Professor Mohammed Bennamoun on 3D-based audio visual speech recognition. His current research interests include audio-visual speech recognition, computer vision and machine learning.

[Back to Top]


Title: Semantic Web and Linked Big Data Resources for Spoken Language Processing
Presenters: Dilek Hakkani-Tür and Larry Heck , Microsoft Research, US

Abstract: State-of-the-art statistical spoken language processing typically requires significant manual effort to construct domain-specific schemas (ontologies) as well as manual effort to annotate training data against these schemas. At the same time, a recent surge of activity and progress on semantic web-related concepts from the large search-engine companies represents a potential alternative to the manually intensive design of spoken language processing systems. Standards such as have been established for schemas (ontologies) that webmasters can use to semantically and uniformly markup their web pages. Search engines like Bing, Google, and Yandex have adopted these standards and are leveraging them to create semantic search engines at the scale of the web. As a result, the open linked data resources and semantic graphs covering various domains (such as Freebase [3]) have grown massively every year and contains far more information than any single resource anywhere on the Web. Furthermore, these resources contain links to text data (such as Wikipedia pages) related to the knowledge in the graph.

Recently, several studies on speech language processing started exploiting these massive linked data resources for language modeling and spoken language understanding. This tutorial will include a brief introduction to the semantic web and the linked data structure, available resources, and querying languages. An overview of related work on information extraction and language processing will be presented, where the main focus will be on methods for learning spoken language understanding models from these resources.

Biography: Dilek Hakkani-Tür ( is a senior researcher at Microsoft Research. Prior to joining Microsoft, she was a senior researcher at ICSI speech group (2006-2010) and she was a senior technical staff member in the Voice Enabled Services Research Department at AT&T Labs-Research in Florham Park, NJ (2001-2005). She received her BSc degree from Middle East Technical University, in 1994, and MSc and PhD degrees from Bilkent University, Department of Computer Engineering, in 1996 and 2000, respectively. Her PhD thesis is on statistical language modeling for agglutinative languages. Her research interests include spoken language processing, spoken dialog systems, and active and unsupervised learning for language processing. She has over 30 granted patents and co-authored more than 200 papers in these areas. She is the recipient of three best paper awards for her work on active learning, from IEEE Signal Processing Society, ISCA and EURASIP. She is a member of ISCA, IEEE, and Association for Computational Linguistics. She was recognized as an IEEE Fellow in 2014. She was an associate editor of IEEE Transactions on Audio, Speech and Language Processing (2005-2008), a member of the HLT advisory board (2009-2010), and an elected member of the IEEE Speech and Language Technical Committee (2009-2013). She currently serves as an area editor of IEEE Signal Processing Letters (2012-2014) and Elsevier’s Digital Signal Processing Journal (2012-2014).

Biography: Larry Heck ( is a Microsoft Distinguished Engineer in Microsoft Research. His research area is natural conversational interaction, focusing on open-domain NLP and dialog, machine learning, multimodal NUI, and inference/reasoning under uncertainty. From 2005 to 2009, he was Vice President of Search & Advertising Sciences at Yahoo!, responsible for the creation, development, and deployment of the algorithms powering Yahoo! Search, Yahoo! Sponsored Search, Yahoo! Content Match, and Yahoo! display advertising. From 1998 to 2005, he was with Nuance Communications and served as Vice President of R&D, responsible for natural language processing, speech recognition, voice authentication, and text-to-speech synthesis technology. He began his career as a researcher at the Stanford Research Institute (1992-1998), initially in the field of acoustics and later in speech research with the Speech Technology and Research (STAR) Laboratory. Dr. Heck received the PhD in Electrical Engineering from the Georgia Institute of Technology in 1991.

[Back to Top]


Title: Speech and Audio for Multimedia Semantics
Presenters: Florian Metze (Carnagie Mellon University, USA ) and Koichi Shinoda (Tokyo Institute of Technology, Japan)

Abstract: Internet media sharing sites and the one-click upload capability of smartphones are producing a deluge of multimedia content. While visual features are often dominant in such material, acoustic and speech information in particular often complements it. By facilitating access to large amounts of data, the text-based Internet gave a huge boost to the field of natural language processing. The vast amount of consumer-produced video becoming available now will do the same for video processing, eventually enabling semantic understanding of multimedia material, with implications for human computer interaction, robotics, etc.

Large-scale multi-modal analysis of audio-visual material is now central to a number of multi-site research projects around the world. While each of these have slightly different targets, they are facing largely the same challenges: how to robustly and efficiently process large amounts of data, how to represent and then fuse information across modalities, how to train classifiers and segmenters on unlabeled data, how to include human feedback, etc.

In this tutorial, we will present the state of the art in large-scale video, speech, and non-speech audio processing, and show how these approaches are being applied to tasks such as content based video retrieval (CBVR) and multimedia event detection (MED). We will introduce the most important tools and techniques, and show how the combination of information across modalities can be used to induce semantics on multimedia material through ranking of information and fusion. Finally, we will discuss opportunities for research that the INTERSPEECH community specifically will find interesting and fertile.

Biography: Florian Metze received his PhD from Universität Karlsruhe (TH) for a thesis on “Articulatory Features for Conversational Speech Recognition” in 2005. He worked as a Senior Research Scientist at Deutsche Telekom Laboratories (T-Labs) from 2006 to 2009, and is now faculty at Carnegie Mellon University. Dr. Metze has worked on a wide range of topics in the field of speech and audio processing, information retrieval, as well as user interfaces, and holds several patents. Recent work includes low resource speech recognition, non-textual aspects of speech (such as the perception of personality of speech and emotional speech synthesis) as well as video retrieval and textual summarization of multi-media material. He regularly participates in international evaluations, most recently in Trecvid 2013 (MED and MER) and OpenKWS 2013. He is also one of the founders of the “Speech Recognition Virtual Kitchen”.

Web page:

Biography: Koichi Shinoda received his B.S. in 1987 and his M.S. in 1989, both in physics, from the University of Tokyo. He received his D.Eng. in computer science from Tokyo Institute of Technology in 2001. In 1989, he joined NEC Corporation and was involved in research on automatic speech recognition. From 1997 to 1998, he was a visiting scholar with Bell Labs, Lucent Technologies. From 2001, he was an Associate Professor with the University of Tokyo. He is currently a Professor at the Tokyo Institute of Technology. His research interests include speech recognition, video information retrieval, and human interfaces. Dr. Shinoda received the Awaya Prize from the Acoustic Society of Japan in 1997 and the Excellent Paper Award from the Institute of Electronics, Information, and Communication Engineers (IEICE) in 1998. He is an Associate Editor of Computer Speech and Language and a Subject Editor of Speech Communication. He is a member of IEEE, ACM, ASJ, IEICE, IPSJ, and JSAI.

Web page:

[Back to Top]


Title: Adaptation Techniques for Statistical Speech Recognition
Presenters: Kai Yu (Shanghai Jiao Tong University, Shanghai)

Abstract: Adaptation is a technique to make better use of existing models for test data from new acoustic or linguistic conditions. It is an important and challenging research area of statistical speech recognition. This tutorial gives a systematic review of fundamental theories as well as introduction of state-of-the-art adaptation techniques. It includes both acoustic and language model adaptation. Following a simple example of acoustic model adaptation, basic concepts, procedures and categories of adaptation will be introduced. Then, a number of advanced adaptation techniques will be discussed, such as discriminative adaptation, Deep Neural Network adaptation, adaptive training, relationship to noise robustness etc. After the detailed review of acoustic model adaptation, an introduction of language model adaptation, such as topic adaptation will also be given. The whole tutorial is then summarised and future research direction will be discussed.

Biography: Kai Yu is a research professor in the Computer Science and Engineering Department of Shanghai Jiao Tong University, China. He obtained his Bachelor and Master degrees from Tsinghua University, Beijing, China and his Ph.D. from Cambridge University. He has published over 50 peer-reviewed journal and conference publications on speech recognition, synthesis and dialogue systems. He was a key member of the Cambridge team to build state-of-the-art LVCSR systems in the DARPA funded EARS and GALE projects. He has also managed the design and implementation of large-scale real-world ASR cloud. He is a senior member of IEEE, a member of ISCA and the IET. He was the area chair for speech recognition and processing for INTERSPEECH 2009 and EUSIPCO 2011, the publication chair for IEEE ASRU 2011 and the area chair of spoken dialogue systems for INTERSPEECH 2014. He was selected into the "1000 Overseas Talent Plan (Young Talent)" by Chinese central government in 2012. He was also selected into the Programme for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning.

[Back to Top]


Title:Emotion and Mental State Recognition: Features, Models, System Applications and Beyond
Presenters: Chung-Hsien Wu (National Cheng Kung University, Tainan City), Hsin-Min Wang (Academia Sinica, Taipei), Julien Epps (The University of New South Wales, Australia) and Vidhyasaharan Sethu (The University of New South Wales, Australia)

Abstract: Emotion recognition is the ability to identify what you are feeling from moment to moment and to understand the connection between your feelings and your expressions. In today’s world, human-computer interaction (HCI) interface undoubtedly plays an important role in our daily life. Toward harmonious HCI interfaces, automated analysis and recognition of human emotion has attracted increasing attention from researchers in multidisciplinary research fields. A specific area of current interest that also has key implications for HCI is the estimation of cognitive load (mental workload), research into which is still at an early stage. Technologies for processing daily activities including speech, text and music have expanded the interaction modalities between humans and computer-supported communicational artifacts.

In this tutorial, we will present theoretical and practical work offering new and broad views of the latest research in emotional awareness from audio and speech. We discuss several parts spanning a variety of theoretical background and applications ranging from salient emotional features, emotional-cognitive models, compensation methods for variability due to speaker and linguistic content, to machine learning approaches applicable to emotion recognition. In each topic, we will review the state of the art by introducing current methods and presenting several applications. In particular, the application to cognitive load estimation will be discussed, from its psychophysiological origins to system design considerations. Eventually, technologies developed in different areas will be combined for future applications, so in addition to a survey of future research challenges, we will envision a few scenarios in which affective computing can make a difference.

Biography: Prof. Chung-Hsien Wu received the Ph.D. degree in electrical engineering from National Cheng Kung University, Tainan, Taiwan, R.O.C., in 1991. Since August 1991, he has been with the Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan. He became professor and distinguished professor in August 1997 and August 2004, respectively. From 1999 to 2002, he served as the Chairman of the Department. Currently, he is the deputy dean of the College of Electrical Engineering and Computer Science, National Cheng Kung University. He also worked at Computer Science and Artificial Intelligence Laboratory of Massachusetts Institute of Technology (MIT), Cambridge, MA, in summer 2003 as a visiting scientist. He received the Outstanding Research Award of National Science Council in 2010 and the Distinguished Electrical Engineering Professor Award of the Chinese Institute of Electrical Engineering in 2011, Taiwan. He is currently associate editor of IEEE Trans. Audio, Speech and Language Processing, IEEE Trans. Affective Computing, ACM Trans. Asian Language Information Processing, and the Subject Editor on Information Engineering of Journal of the Chinese Institute of Engineers (JCIE). His research interests include affective speech recognition, expressive speech synthesis, and spoken language processing. Dr. Wu is a senior member of IEEE and a member of International speech communication association (ISCA). He was the President of the Association for Computational Linguistics and Chinese Language Processing (ACLCLP) in 2009~2011. He was the Chair of IEEE Tainan Signal Processing Chapter and has been the Vice Chair of IEEE Tainan Section since 2009.

Biography: Dr. Hsin-Min Wang received the B.S. and Ph.D. degrees in electrical engineering from National Taiwan University in 1989 and 1995, respectively. In October 1995, he joined the Institute of Information Science, Academia Sinica, where he is now a research fellow and deputy director. He was an adjunct associate professor with National Taipei University of Technology and National Chengchi University. He currently serves as the president of the Association for Computational Linguistics and Chinese Language Processing (ACLCLP), a managing editor of Journal of Information Science and Engineering, and an editorial board member of International Journal of Computational Linguistics and Chinese Language Processing. His major research interests include spoken language processing, natural language processing, multimedia information retrieval, and pattern recognition. Dr. Wang received the Chinese Institute of Engineers (CIE) Technical Paper Award in 1995 and the ACM Multimedia Grand Challenge First Prize in 2012. He is a senior member of IEEE, a member of ISCA and ACM, and a life member of Asia Pacific Signal and Information Processing Association (APSIPA), ACLCLP, and Institute of Information & Computing Machinery (IICM).

Biography: Dr Julien Epps received the BE and PhD degrees in Electrical Engineering from the University of New South Wales, Australia, in 1997 and 2001 respectively. After an appointment as a Postdoctoral Fellow at the University of New South Wales, he worked on speech recognition and speech processing research firstly as a Research Engineer at Motorola Labs and then as a Senior Researcher at National ICT Australia. He was appointed as a Senior Lecturer in the UNSW School of Electrical Engineering and Telecommunications in 2007 and then as an Associate Professor in 2013. Dr Epps has also held visiting academic and research appointments at The University of Sydney and the A*STAR Institute for Infocomm Research (Singapore). He has authored or co-authored around 150 publications, which have been collectively cited more than 1500 times. He has served as a reviewer for most major speech processing journals and conferences and as a Guest Editor for the EURASIP Journal on Advances in Signal Processing Special Issue on Emotion and Mental State Recognition from Speech. He has also co-organised or served on the committees of key workshops related to this tutorial, such as the ACM ICMI Workshop on Inferring Cognitive and Emotional States from Multimodal Measures (2011), ASE/IEEE Int. Conf. on Social Computing Workshop on Wide Spectrum Social Signal Processing (2012), 4th International Workshop on Corpora for Research on Emotion, Sentiment and Social Signals (Satellite of LREC 2012), Audio/Visual Emotion Challenge and Workshop AVEC 2011 (part of the Int. Conf. on Affective Computing and Intelligent Interaction), AVEC 2012 (part of ACM ICMI) and AVEC 2013 (part of ACM Multimedia). His research interests include applications of speech modelling to emotion and mental state classification and speaker verification.

Biography: Dr Vidhyasaharan Sethu received his BE degree from Anna University, India, and his MEngSc (Signal Processing) degree from the University of New South Wales, Australia. He was awarded his PhD in 2010 for his work on Automatic Emotion Recognition, by the University of New South Wales (UNSW). Following this, he worked as a Postdoctoral Research Fellow at the speech research group at UNSW on the joint modelling of linguistic and paralinguistic information in speech with a focus on emotion recognition. He is currently a Lecturer in Signal Processing at the School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, Australia. He teaches courses on speech processing, signal processing and electrical system design in the school and is a reviewer for a number of journals including Speech Communication and EURASIP Journal on Audio, Speech and Music Processing and IEEE Transactions on Education. His research interests include emotion recognition, speaker recognition, language identification and the application of machine learning in speech processing.

[Back to Top]


Title: Unsupervised Speech and Language Processing via Topic Models
Presenters: Jen-Tzung Chien (National Chiao Tung University, Hsinchu)

Abstract: In this tutorial, we will present state-of-art machine learning approaches for speech and language processing with highlight on the unsupervised methods for structural learning from the unlabeled sequential patterns. In general, speech and language processing involves extensive knowledge of statistical models. We require designing a flexible, scalable and robust system to meet heterogeneous and nonstationary environments in the era of big data. This tutorial starts from an introduction of unsupervised speech and language processing based on factor analysis and independent component analysis. The unsupervised learning is generalized to a latent variable model which is known as the topic model. The evolution of topic models from latent semantic analysis to hierarchical Dirichlet process, from non-Bayesian parametric models to Bayesian nonparametric models, and from single-layer model to hierarchical tree model shall be surveyed in an organized fashion. The inference approaches based on variational Bayesian and Gibbs sampling are introduced. We will also present several case studies on topic modeling for speech and language applications including language model, document model, retrieval model, segmentation model and summarization model. At last, we will point out new trends of topic models for speech and language processing.

Biography: Jen-Tzung Chien received his Ph.D. degree in electrical engineering from National Tsing Hua University, Hsinchu, in 1997. During 1997-2012, he was with the National Cheng Kung University, Tainan. Since 2012, he has been with the Department of Electrical and Computer Engineering, National Chiao Tung University (NCTU), Hsinchu, where he is currently a Distinguished Professor. He serves as an adjunct professor in the Department of Computer Science, NCTU. He held the Visiting Researcher positions at the Panasonic Technologies Inc., Santa Barbara, CA, the Tokyo Institute of Technology, Tokyo, Japan, the Georgia Institute of Technology, Atlanta, GA, the Microsoft Research Asia, Beijing, China, and the IBM T. J. Watson Research Center, Yorktown Heights, NY. His research interests include machine learning, speech recognition, information retrieval and blind source separation. He served as the associate editor of the IEEE Signal Processing Letters in 2008-2011, the guest editor of the IEEE Transactions on Audio, Speech and Language Processing in 2012, the organization committee member of the ICASSP 2009, and the area coordinator of the Interspeech 2012. He is appointed as the APSIPA Distinguished Lecturer for 2012-2013. He received the Distinguished Research Award from the National Science Council in 2006 and 2010. He was a co-recipient of the Best Paper Award of the IEEE Automatic Speech Recognition and Understanding Workshop in 2011. Dr. Chien has served as the tutorial speaker for ICASSP 2012 at Kyoto, Interspeech 2013 at Lyon, and APSIPA 2013 at Kaohsiung.

Jen-Tzung Chien (

[Back to Top]


Title: Deep Learning for Speech Generation and Synthesis
Presenters: Yao Qian and Frank K. Soong (Microsoft Research Asia, Beijing)

Abstract: Deep learning, which can represent high-level abstractions in data with an architecture of multiple non-linear transformation, has made a huge impact on automatic speech recognition (ASR) research, products and services. However, deep learning for speech generation and synthesis (i.e., text-to-speech), which is an inverse process of speech recognition (i.e., speech-to-text), has not generated the similar momentum as it is for ASR yet. Recently, motivated by the success of Deep Neural Networks in speech recognition, some neural network based research attempts have been tried successfully on improving the performance of statistical parametric based speech generation/synthesis. In this tutorial, we focus on deep learning approaches to the problems in speech generation and synthesis, especially on Text-to-Speech (TTS) synthesis and voice conversion.

First, we give a review for the current main stream of statistical parametric based speech generation and synthesis, or the GMM-HMM based speech synthesis and GMM-based voice conversion with emphasis on analyzing the major factors responsible for the quality problems in the GMM-based voice synthesis/conversion and the intrinsic limitations of a decision-tree based, contextual state clustering and state-based statistical distribution modeling. We then present the latest deep learning algorithms for feature parameter trajectory generation, in contrast to deep learning for recognition or classification. We cover common technologies in Deep Neural Network (DNN) and improved DNN: Mixture Density Networks (MDN), Recurrent Neural Networks (RNN) with Bidirectional Long Short Term Memory (BLSTM) and Conditional RBM (CRBM). Finally, we share our research insights and hand-on experience on building speech generation and synthesis systems based upon deep learning algorithms.

Biography: Yao Qian is a Lead Researcher in Speech Group, Microsoft Research Asia. She received her Ph.D in the Dept. of EE, The Chinese University of Hong Kong, in 2005. She joined Microsoft research Asia in September, 2005, right after receiving her PhD. Her research interests are in spoken language processing, including TTS speech synthesis and automatic speech recognition. Her recent research projects include speech synthesis, voice transformation, prosody modeling and Computer-Assisted Language Learning (CALL). She has over 50 publications on international journals and conferences. She also has ten U.S. patent applications, five issued. She has been recognized within Microsoft and in the speech research community for her contributions to TTS and many other speech technologies. She is a senior member of IEEE and a member of ISCA.

Yao Qia (

Biography: Frank K. Soong is a Principal Researcher, Speech Group, Microsoft Research Asia (MSRA), Beijing, China, where he works on fundamental research on speech and its practical applications. His professional research career spans over 30 years, first with Bell Labs, US, then with ATR, Japan, before joining MSRA in 2004. At Bell Labs, he worked on stochastic modeling of speech signals, optimal decoder algorithm, speech analysis and coding, speech and speaker recognition. He was responsible for developing the recognition algorithm which was developed into voice-activated mobile phone products rated by the Mobile Office Magazine (Apr. 1993) as the "outstandingly the best". He is a co-recipient of the Bell Labs President Gold Award for developing the Bell Labs Automatic Speech Recognition (BLASR) software package.

He has served as a member of the Speech and Language Technical Committee, IEEE Signal Processing Society and other society functions, including Associate Editor of the IEEE Speech and Audio Transactions and chairing IEEE International Workshop. He published extensively with more than 200 papers and co-edited a widely used reference book, Automatic Speech and Speech Recognition- Advanced Topics, Kluwer, 1996. He is a visiting professor of the Chinese University of Hong Kong (CUHK) and a few other top-rated universities in China. He is also the co-Director of the MSRA-CUHK Joint Research Lab. He got his BS, MS and PhD from National Taiwan Univ., Univ. of Rhode Island, and Stanford Univ, all in Electrical Eng. He is an IEEE Fellow.

[Back to Top]


Diamond Sponsors

Gold Sponsors

Silver Sponsors

Bronze Sponsors

Corporate Partner


Media Partners

Copyright © 2013-2015 Chinese and Oriental Languages Information Processing Society
Conference managed by Meeting Matters International