We use a variety of methods to evaluate the effectiveness of our dialogue systems. These include component evaluation, such as using a gold standard approach for speech recognition and NLU performance, and standard external methods such as user questionnaires and task success rate. An increasing part of our efforts is also put into developing new methods for evaluating virtual humans in complex dialogue situations. Since most of our agents' functions are conversational rather than task-oriented, we focus on methods to uncover what contributes to a sense of quality of the overall dialogue interaction between virtual human and user through analysis of transcripts of dialogues. Through this process, we have developed several annotation schemes for evaluating virtual human performance.