Cross modal evaluation of high quality emotional speech synthesis with the virtual human toolkit

Blaise Potard; Matthew P. Aylett; David A. Baude

Conference Proceedings

Cross modal evaluation of high quality emotional speech synthesis with the virtual human toolkit

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016) 10011 LNAI 190-197

DOI: 10.1007/978-3-319-47665-0_17

5Citations

9Readers

Get full text

Abstract

Emotional expression is a key requirement for intelligent virtual agents. In order for an agent to produce dynamic spoken content speech synthesis is required. However, despite substantial work with prerecorded prompts, very little work has explored the combined effect of high quality emotional speech synthesis and facial expression. In this paper we offer a baseline evaluation of the naturalness and emotional range available by combining the freely available SmartBody component of the Virtual Human Toolkit (VHTK) with CereVoice text to speech (TTS) system. Results echo previous work using pre-recorded prompts, the visual modality is dominant and the modalities do not interact. This allows the speech synthesis to add gradual changes to the perceived emotion both in terms of valence and activation. The naturalness reported is good, 3.54 on a 5 point MOS scale.

Author supplied keywords

Cite

CITATION STYLE

APA

Potard, B., Aylett, M. P., & Baude, D. A. (2016). Cross modal evaluation of high quality emotional speech synthesis with the virtual human toolkit. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10011 LNAI, pp. 190–197). Springer Verlag. https://doi.org/10.1007/978-3-319-47665-0_17

Cross modal evaluation of high quality emotional speech synthesis with the virtual human toolkit

Abstract

Author supplied keywords

Cite

Register to see more suggestions