Plenaries and Tutorials
- Plenary 1 (16:30-17:30 Dec 13)
Interactive Computer Aids for Acquiring Proficiency in Mandarin
Dr Stephanie Seneff
Principal Research Scientist
Computer Science and Artificial Intelligence Laboratory (CSAIL), MIT
It is widely recognized that one of the best ways to learn a foreign language is through spoken dialogue with a native speaker. However, this is not a practical method in the classroom due to the one-to-one student/teacher ratio it implies. A potential solution is to rely on computer spoken dialogue systems to role play a conversational partner. This keynote describes our research in adapting multilingual spoken dialogue systems to address this need, for the scenario of a native English speaker learning Chinese. Students can engage in dialogue with the computer either over the telephone or through audio/typed input at a Web page. Several different domains are being developed, in which a student's conversational interaction is assisted by a robotic "tutor" which can provide them with translation assistance at any time. Thus, two recognizers are running in parallel, one for English and one for Chinese. In the talk, I will describe the current status of our research, specifically addressing the following three technology topics: (1) spoken language translation in narrow domains, (2) design of a symmetrical dialogue interaction paradigm appropriate for language learning scenarios, and (3) language assessment, particularly tone production quality. Several audio and video clips will be presented.
- Plenary 2 (8:30-9:30 Dec 14)
The Affective and Pragmatic Coding of Prosody
Prof Klaus R. Scherer, Ph.D
Director, Swiss Center for Affective Sciences
University of Geneva, Switzerland
The vocal expression of humans includes expressions of emotions, such as anger or happiness, and pragmatic intonations, such as interrogative or affirmative, embedded within the language. These two types of prosody are differently affected by the so-called “push” and “pull” effects. Push effects, influenced by psychophysiological activities, strongly affect emotional prosody, whereas pull effects, influenced by cultural rules of expression, predominantly affect intonation or pragmatic prosody, even though both processes influence all prosodic production. Several empirical studies are described that exemplify the possibilities of dissociating emotional and linguistic prosody decoding at the behavioral and neurological level. The results highlight the importance of considering not only the distinction of different types of prosody, but also the relevance of the task accomplished by the participants, to better understand information processes related to human vocal expression at the suprasegmental level.
- Plenary 3 (8:30-9:30 Dec 15)
Challenges in Machine Translation
Dr Franz Josef Och
Senior Staff Research Scientist
In recent years there has been an enormous boom in MT research.
There has been not only an increase in the number of research groups in the field and in the amount of funding, but there is now also optimism for the future of the field and for achieving even better quality. The major reason for this change has been a paradigm shift away from linguistic/rule-based methods towards empirical/data-driven methods in MT. This has been made possible by the availability of large amounts of training data and large computational resources.
This paradigm shift towards empirical methods has fundamentally changed the way MT research is done. The field faces new challenges.
For achieving optimal MT quality, we want to train models on as much data as possible, ideally language models trained on hundreds of billions of words and translation models trained on hundreds of millions of words. Doing that requires very large computational resources, a corresponding software infrastructure, and a focus on systems building and engineering.
In addition to discussing those challenges in MT research, the talk will also give specific examples on how some of the data challenges are being dealt with at Google Research.
- Plenary 4 (8:30-9:30 Dec 16)
Automatic Indexing and Retrieval of Large Broadcast News Video Collections – the TRECVID Experience
Prof Tat-Seng CHUA
School of Computing, National University of Singapore
Most existing operational systems rely purely on automatic speech recognition (ASR) text as the basis for news video indexing and retrieval. While current research shows that ASR text has been the most influential component, results of large scale news video processing experiments indicate that the use of other modality features and external information sources such as the Web is essential in various situations. This talk reviews the frameworks and machine learning techniques used to fuse the ASR text with multi-modal and multi-source information to tackle the challenging problems of story segmentation, concept detection and retrieval in broadcast news video. This paper also points the way towards the development of scalable technology to process large news video archives.
- Tutorial 1 (10:00-12:00 Dec 13)
An HMM-Based Approach to Flexible Speech Synthesis
Prof Keiichi Tokuda
Department of Computer Science and Engineering, Nagoya Institute of Technology
The increasing availability of large speech databases makes it possible to construct speech synthesis systems, which are referred to as
corpus-based, data-driven, speaker-driven, or trainable approach, by
applying statistical learning algorithms.
These systems, which can be automatically trained, not only generate
natural and high quality synthetic speech but also can reproduce voice
characteristics of the original speaker. This talk presents one of
these approaches, HMM-based speech synthesis. The basic idea of the
approach is very simple: just train HMMs (hidden Markov models) and
generate speech directly from them. To realize such a speech synthesis
system, however, we need some tricks: algorithms for speech parameter
generation from HMMs, and a mel-cepstrum based vocoding technique are
reviewed, and an approach to simultaneous modeling of phonetic and
prosodic parameters (spectrum, F0, and duration) is also presented.
The main feature of the system is the use of dynamic
feature: by inclusion of dynamic coefficients in the feature vector,
the speech parameter sequence generated in synthesis is constrained to
be realistic, as defined by the parameters of the HMMs. The attraction
of this approach is that voice characteristics of synthesized speech
can easily be changed by transforming HMM parameters. Actually, it has
been shown that we can change voice characteristics of synthetic speech
by applying a speaker adaptation technique which has been used in
speech recognition systems. The relationship between the HMM-based
approach and other concatenative speech synthesis approaches is also
discussed. In the talk, not only the technical description but also
recent results and demos will be presented.
- Tutorial 2 (13:30-15:30 Dec 13)
Text Information Extraction and Retrieval
Dr Hang Li
Researcher and Project Leader
Microsoft Research Asia
Every day people spend much time on creating, processing, and accessing information. In fact, most of the information exists in the form of "text", contained in books, emails, web pages, news paper articles, blogs, and reports. How to help people quickly find information from text data and how to help people discover new knowledge from text data has become an enormously important issue. Many research efforts have been made on text information extraction, retrieval, and mining; and significant progress has made in recent years. A large number of new methods have been proposed, and many systems have been developed and put into practical uses. This tutorial is aimed at giving an overview on two central topics of the area: namely Information Extraction (IE) and Information Retrieval (IR). Important technologies on them will be introduced. Specifically, models for IE such as Maximum Entropy Markov Model and Conditional Random Filed will be explained. Models for IR such as Language Model and Learning to Rank will be described. A brief survey on recent work on both IE and IR will be given. Finally, some recent work on the combined uses of IE and IR technologies will also be introduced.