Practical Evaluation of Human and Synthesized Speech for Virtual Human Dialogue Systems

TitlePractical Evaluation of Human and Synthesized Speech for Virtual Human Dialogue Systems
Publication TypeConference Paper
Year of Publication2012
AuthorsGeorgila, K., A. W. Black, K. Sagae, and D. R. Traum
Conference NameInternational Conference on Language Resources and Evaluation (LREC)
Date PublishedMay 2012
Conference LocationIstanbul, Turkey

The current practice in virtual human dialogue systems is to use professional human recordings or limited-domain speech synthesis. Both approaches lead to good performance but at a high cost. To determine the best trade-off between performance and cost, we perform a systematic evaluation of human and synthesized voices with regard to naturalness, conversational aspect, and likability. We also vary the type (in-domain vs. out-of-domain), length, and content of utterances, and take into account the age and native language of raters as well as their familiarity with speech synthesis. We present detailed results from two studies, a pilot one and one run on Amazon's Mechanical Turk. Our results suggest that a professional human voice can supersede both an amateur human voice and synthesized voices. Also, a high-quality general-purpose voice or a good limited-domain voice can perform better than amateur human recordings. We do not find any significant differences between the performance of a high-quality general-purpose voice and a limited-domain voice, both trained with speech recorded by actors. As expected, in most cases, the high-quality general-purpose voice is rated higher than the limited-domain voice for out-of-domain sentences and lower for in-domain sentences. There is also a not statistically significant trend
for long or negative-content utterances to receive lower ratings.