Incremental Speech Understanding in a Multi-Party Virtual Human Dialogue System

TitleIncremental Speech Understanding in a Multi-Party Virtual Human Dialogue System
Publication TypeConference Paper
Year of Publication2012
AuthorsDeVault, D., and D. R. Traum
Conference NameNAACL HLT 2012
Date PublishedJune 3-8, 2012
Conference LocationMontreal, Canada

1 Extended Abstract
This demonstration highlights some emerging ca-
pabilities for incremental speech understanding and
processing in virtual human dialogue systems. This
work is part of an ongoing effort that aims to en-
able realistic spoken dialogue with virtual humans in
multi-party negotiation scenarios (Pl¨uss et al., 2011;
Traum et al., 2008b). These scenarios are designed
to allow trainees to practice their negotiation skills
by engaging in face-to-face spoken negotiation with
one or more virtual humans.
An important component in achieving naturalistic
behavior in these negotiation scenarios, which ide-
ally should have the virtual humans demonstrating
fluid turn-taking, complex reasoning, and respond-
ing to factors like trust and emotions, is for the vir-
tual humans to begin to understand and in some
cases respond in real time to users' speech, as the
users are speaking (DeVault et al., 2011b). These re-
sponses could range from relatively straightforward
turn management behaviors, like having a virtual hu-
man recognize when it is being addressed by a user
utterance, and possibly turn to look at the user who
has started speaking, to more complex responses
such as emotional reactions to the content of what
users are saying.
The current demonstration extends our previous
demonstration of incremental processing (Sagae et
al., 2010) in several important respects. First, it
includes additional indicators, as described in (De-
Vault et al., 2011a). Second, it is applied to a new
domain, an extension of that presented in (Pl¨uss et
al., 2011). Finally, it is integrated with the dialogue
Figure 1: SASO negotiation in the saloon: Utah (left)
looking at Harmony (right).
models (Traum et al., 2008a), such that each par-
tial interpretation is given a full pragmatic interpre-
tation by each virtual character, which can be used
to generate real-time incremental non-verbal feed-
back (Wang et al., 2011).
Our demonstration is set in an implemented multi-
party negotiation domain (Pl¨uss et al., 2011) in
which two virtual humans, Utah and Harmony (pic-
tured in Figure 1), talk with two human negotiation
trainees, who play the roles of Ranger and Deputy.
The dialogue takes place inside a saloon in an Amer-
ican town in the Old West. In this negotiation sce-
nario, the goal of the two human role players is to
convince Utah and Harmony that Utah, who is cur-
rently employed as the local bartender, should take
on the job of town sheriff.
One of the research aims for this work is to
support natural dialogue interaction, an example of
which is the excerpt of human role play dialogue
shown in Figure 2. One of the key features of immer-
sive role plays is that people often react in multiple
ways to the utterances of others as they are speaking.
For example, in this excerpt, the beginning of the
background image
Ranger We can't leave this place and have it overrun by outlaws.
Uh there's no way that's gonna happen so we're gonna
make sure we've got a properly deputized and equipped
sheriff ready to maintain order in this area.
00:03:56.660 - 00:04:08.830
Deputy Yeah and you know and and we're willing to
00:04:06.370 - 00:04:09.850
Utah And I don't have to leave the bar completely. I can still
uh be here part time and I can um we can hire someone to
do the like day to day work and I'll do the I'll supervise
them and I'll teach them.
00:04:09.090 - 00:04:22.880
Figure 2: Dialogue excerpt from one of the role plays.
Timestamps indicate the start and end of each utterance.
Deputy's utterance overlaps the end of the Ranger's,
and then Utah interrupts the Deputy and takes the
floor a few seconds later.
Our prediction approach to incremental speech
understanding utilizes a corpus of in-domain spo-
ken utterances, including both paraphrases selected
and spoken by system developers, as well as spo-
ken utterances from user testing sessions (DeVault
et al., 2011b). An example of a corpus element is
shown in Figure 3. In previous negotiation domains,
we have found a fairly high word error rate in au-
tomatic speech recognition results for such sponta-
neous multi-party dialogue data; for example, our
average word error rate was 0.39 in the SASO-EN
negotiation domain (Traum et al., 2008b) with many
(15%) out of domain utterances. Our speech un-
derstanding framework is robust to these kinds of
problems (DeVault et al., 2011b), partly through
approximating the meaning of utterances. Utter-
ance meanings are represented using an attribute-
value matrix (AVM), where the attributes and val-
ues represent semantic information that is linked to
a domain-specific ontology and task model (Traum,
2003; Hartholt et al., 2008; Pl¨uss et al., 2011). The
AVMs are linearized, using a path-value notation, as
seen in Figure 3. In our framework, we use this data
to train two data-driven models, one for incremen-
tal natural language understanding, and a second for
incremental confidence modeling.
The first step is to train a predictive incremental
understanding model. This model is based on maxi-
mum entropy classification, and treats entire individ-
ual frames as output classes, with input features ex-
tracted from partial ASR results, calculated in incre-
ments of 200 milliseconds (DeVault et al., 2011b).
· Utterance (speech): i've come here today to talk to you
about whether you'd like to become the sheriff of this town
· ASR (NLU input): have come here today to talk to you
about would the like to become the sheriff of this town
· Frame (NLU output):
.mood interrogative
.sem.modal.desire want
.sem.prop.agent utah
.sem.prop.event providePublicServices
.sem.prop.location town
.sem.prop.theme sheriff-job
.sem.prop.type event
.sem.q-slot polarity
.sem.speechact.type info-req
.sem.type question
Figure 3: Example of a corpus training example.
Each partial ASR result then serves as an incremen-
tal input to NLU, which is specially trained for par-
tial input as discussed in (Sagae et al., 2009). NLU
is predictive in the sense that, for each partial ASR
result, the NLU module produces as output the com-
plete frame that has been associated by a human an-
notator with the user's complete utterance, even if
that utterance has not yet been fully processed by
the ASR. For a detailed analysis of the performance
of the predictive NLU, see (DeVault et al., 2011b).
The second step in our framework is to train a set
of incremental confidence models (DeVault et al.,
2011a), which allow the agents to assess in real time,
while a user is speaking, how well the understand-
ing process is proceeding. The incremental confi-
dence models build on the notion of NLU F-score,
which we use to quantify the quality of a predicted
NLU frame in relation to the hand-annotated correct
frame. The NLU F-score is the harmonic mean of
the precision and recall of the attribute-value pairs
(or frame elements) that compose the predicted and
correct frames for each partial ASR result. By using
precision and recall of frame elements, rather than
simply looking at frame accuracy, we take into ac-
count that certain frames are more similar than oth-
ers, and allow for cases when the correct frame is
not in the training set.
Each of our incremental confidence models
makes a binary prediction for each partial NLU re-
sult as an utterance proceeds. At each time t dur-
background image
Figure 4: Visualization of Incremental Speech Processing.
ing an utterance, we consider the current NLU F-
Score F
as well as the final NLU F-Score F
that will be achieved at the conclusion of the ut-
terance. In (DeVault et al., 2009) and (DeVault
et al., 2011a), we explored the use of data-driven
decision tree classifiers to make predictions about
these values, for example whether F
(current level of understanding is "high"), F
(current level of understanding will not improve),
or F
(final level of understanding will be
"high"). In this demonstration, we focus on the
first and third of these incremental confidence met-
rics, which we summarize as "Now Understanding"
and "Will Understand", respectively. In an evalua-
tion over all partial ASR results for 990 utterances
in this new scenario, we found the Now Under-
standing model to have precision/recall/F-Score of
.92/.75/.82, and the Will Understand model to have
precision/recall/F-Score of .93/.85/.89. These incre-
mental confidence models therefore provide poten-
tially useful real-time information to Utah and Har-
mony about whether they are currently understand-
ing a user utterance, and whether they will ever un-
derstand a user utterance.
The incremental ASR, NLU, and confidence
models are passed to the dialogue managers for each
of the agents, Harmony and Utah. These agents then
relate these inputs to their own models of dialogue
context, plans, and emotions, to calculate pragmatic
interpretations, including speech acts, reference res-
olution, participant status, and how they feel about
what is being discussed. A subset of this informa-
tion is passed to the non-verbal behavior generation
module to produce incremental non-verbal listening
behaviors (Wang et al., 2011).
In support of this demonstration, we have ex-
tended the implementation to include a real-time vi-
sualization of incremental speech processing results,
which will allow attendees to track the virtual hu-
mans' understanding as an utterance progresses. An
example of this visualization is shown in Figure 4.