Dialogue Corpus Annotation

The Natural Language Dialogue group maintains a large set of corpora that are used in our research. Most of these corpora have been collected in connection with a specific virtual human effort; these include dialogues between humans and virtual humans as well as dialogues between human participants (such as role plays). Many of the corpora contain speech, which is transcribed, and if Automatic Speech Recognition has been used we retain that output as well. Further annotations depend on the project for which the corpus was collected, and are used for training the machine-learning components that drive the systems. Typical annotations include dialogue acts; semantic representations of utterances, using system-specific semantic languages; appropriate responses to questions for direct question-answering systems; splitting utterances into individual meaningful units; syntactic annotation for understanding and generation (typically using external parsers, chunkers and taggers). We also annotate some corpora for the purpose of evaluation, and these annotations typically rate the correctness or appropriateness of various system outputs.

NLD Group Leaders