Incremental Speech Understanding in a Multi-Party Virtual Human Dialogue System

Title	Incremental Speech Understanding in a Multi-Party Virtual Human Dialogue System
Publication Type	Conference Paper
Year of Publication	2012
Authors	DeVault, D., and D. R. Traum
Conference Name	NAACL HLT 2012
Date Published	June 3-8, 2012
Conference Location	Montreal, Canada
Abstract	1 Extended Abstract This demonstration highlights some emerging ca- pabilities for incremental speech understanding and processing in virtual human dialogue systems. This work is part of an ongoing effort that aims to en- able realistic spoken dialogue with virtual humans in multi-party negotiation scenarios (Pl¨uss et al., 2011; Traum et al., 2008b). These scenarios are designed to allow trainees to practice their negotiation skills by engaging in face-to-face spoken negotiation with one or more virtual humans. An important component in achieving naturalistic behavior in these negotiation scenarios, which ide- ally should have the virtual humans demonstrating fluid turn-taking, complex reasoning, and respond- ing to factors like trust and emotions, is for the vir- tual humans to begin to understand and in some cases respond in real time to users' speech, as the users are speaking (DeVault et al., 2011b). These re- sponses could range from relatively straightforward turn management behaviors, like having a virtual hu- man recognize when it is being addressed by a user utterance, and possibly turn to look at the user who has started speaking, to more complex responses such as emotional reactions to the content of what users are saying. The current demonstration extends our previous demonstration of incremental processing (Sagae et al., 2010) in several important respects. First, it includes additional indicators, as described in (De- Vault et al., 2011a). Second, it is applied to a new domain, an extension of that presented in (Pl¨uss et al., 2011). Finally, it is integrated with the dialogue Figure 1: SASO negotiation in the saloon: Utah (left) looking at Harmony (right). models (Traum et al., 2008a), such that each par- tial interpretation is given a full pragmatic interpre- tation by each virtual character, which can be used to generate real-time incremental non-verbal feed- back (Wang et al., 2011). Our demonstration is set in an implemented multi- party negotiation domain (Pl¨uss et al., 2011) in which two virtual humans, Utah and Harmony (pic- tured in Figure 1), talk with two human negotiation trainees, who play the roles of Ranger and Deputy. The dialogue takes place inside a saloon in an Amer- ican town in the Old West. In this negotiation sce- nario, the goal of the two human role players is to convince Utah and Harmony that Utah, who is cur- rently employed as the local bartender, should take on the job of town sheriff. One of the research aims for this work is to support natural dialogue interaction, an example of which is the excerpt of human role play dialogue shown in Figure 2. One of the key features of immer- sive role plays is that people often react in multiple ways to the utterances of others as they are speaking. For example, in this excerpt, the beginning of the background image Ranger We can't leave this place and have it overrun by outlaws. Uh there's no way that's gonna happen so we're gonna make sure we've got a properly deputized and equipped sheriff ready to maintain order in this area. 00:03:56.660 - 00:04:08.830 Deputy Yeah and you know and and we're willing to 00:04:06.370 - 00:04:09.850 Utah And I don't have to leave the bar completely. I can still uh be here part time and I can um we can hire someone to do the like day to day work and I'll do the I'll supervise them and I'll teach them. 00:04:09.090 - 00:04:22.880 Figure 2: Dialogue excerpt from one of the role plays. Timestamps indicate the start and end of each utterance. Deputy's utterance overlaps the end of the Ranger's, and then Utah interrupts the Deputy and takes the floor a few seconds later. Our prediction approach to incremental speech understanding utilizes a corpus of in-domain spo- ken utterances, including both paraphrases selected and spoken by system developers, as well as spo- ken utterances from user testing sessions (DeVault et al., 2011b). An example of a corpus element is shown in Figure 3. In previous negotiation domains, we have found a fairly high word error rate in au- tomatic speech recognition results for such sponta- neous multi-party dialogue data; for example, our average word error rate was 0.39 in the SASO-EN negotiation domain (Traum et al., 2008b) with many (15%) out of domain utterances. Our speech un- derstanding framework is robust to these kinds of problems (DeVault et al., 2011b), partly through approximating the meaning of utterances. Utter- ance meanings are represented using an attribute- value matrix (AVM), where the attributes and val- ues represent semantic information that is linked to a domain-specific ontology and task model (Traum, 2003; Hartholt et al., 2008; Pl¨uss et al., 2011). The AVMs are linearized, using a path-value notation, as seen in Figure 3. In our framework, we use this data to train two data-driven models, one for incremen- tal natural language understanding, and a second for incremental confidence modeling. The first step is to train a predictive incremental understanding model. This model is based on maxi- mum entropy classification, and treats entire individ- ual frames as output classes, with input features ex- tracted from partial ASR results, calculated in incre- ments of 200 milliseconds (DeVault et al., 2011b). · Utterance (speech): i've come here today to talk to you about whether you'd like to become the sheriff of this town · ASR (NLU input): have come here today to talk to you about would the like to become the sheriff of this town · Frame (NLU output): .mood interrogative .sem.modal.desire want .sem.prop.agent utah .sem.prop.event providePublicServices .sem.prop.location town .sem.prop.theme sheriff-job .sem.prop.type event .sem.q-slot polarity .sem.speechact.type info-req .sem.type question Figure 3: Example of a corpus training example. Each partial ASR result then serves as an incremen- tal input to NLU, which is specially trained for par- tial input as discussed in (Sagae et al., 2009). NLU is predictive in the sense that, for each partial ASR result, the NLU module produces as output the com- plete frame that has been associated by a human an- notator with the user's complete utterance, even if that utterance has not yet been fully processed by the ASR. For a detailed analysis of the performance of the predictive NLU, see (DeVault et al., 2011b). The second step in our framework is to train a set of incremental confidence models (DeVault et al., 2011a), which allow the agents to assess in real time, while a user is speaking, how well the understand- ing process is proceeding. The incremental confi- dence models build on the notion of NLU F-score, which we use to quantify the quality of a predicted NLU frame in relation to the hand-annotated correct frame. The NLU F-score is the harmonic mean of the precision and recall of the attribute-value pairs (or frame elements) that compose the predicted and correct frames for each partial ASR result. By using precision and recall of frame elements, rather than simply looking at frame accuracy, we take into ac- count that certain frames are more similar than oth- ers, and allow for cases when the correct frame is not in the training set. Each of our incremental confidence models makes a binary prediction for each partial NLU re- sult as an utterance proceeds. At each time t dur- background image Figure 4: Visualization of Incremental Speech Processing. ing an utterance, we consider the current NLU F- Score F t as well as the final NLU F-Score F final that will be achieved at the conclusion of the ut- terance. In (DeVault et al., 2009) and (DeVault et al., 2011a), we explored the use of data-driven decision tree classifiers to make predictions about these values, for example whether F t 1 2 (current level of understanding is "high"), F t F final (current level of understanding will not improve), or F final 1 2 (final level of understanding will be "high"). In this demonstration, we focus on the first and third of these incremental confidence met- rics, which we summarize as "Now Understanding" and "Will Understand", respectively. In an evalua- tion over all partial ASR results for 990 utterances in this new scenario, we found the Now Under- standing model to have precision/recall/F-Score of .92/.75/.82, and the Will Understand model to have precision/recall/F-Score of .93/.85/.89. These incre- mental confidence models therefore provide poten- tially useful real-time information to Utah and Har- mony about whether they are currently understand- ing a user utterance, and whether they will ever un- derstand a user utterance. The incremental ASR, NLU, and confidence models are passed to the dialogue managers for each of the agents, Harmony and Utah. These agents then relate these inputs to their own models of dialogue context, plans, and emotions, to calculate pragmatic interpretations, including speech acts, reference res- olution, participant status, and how they feel about what is being discussed. A subset of this informa- tion is passed to the non-verbal behavior generation module to produce incremental non-verbal listening behaviors (Wang et al., 2011). In support of this demonstration, we have ex- tended the implementation to include a real-time vi- sualization of incremental speech processing results, which will allow attendees to track the virtual hu- mans' understanding as an utterance progresses. An example of this visualization is shown in Figure 4.
URL	http://www.pdfdownload.org/pdf2html/pdf2html.php?url=http%3A%2F%2Fpeople.ict.usc.edu%2F~traum%2FPapers%2Fnaaclhlt2012.pdf&images=yes

Natural Language Dialogue group

Primary links

Incremental Speech Understanding in a Multi-Party Virtual Human Dialogue System