Cross modal evaluation of high quality emotional speech synthesis with the virtual human toolkit

5Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Emotional expression is a key requirement for intelligent virtual agents. In order for an agent to produce dynamic spoken content speech synthesis is required. However, despite substantial work with prerecorded prompts, very little work has explored the combined effect of high quality emotional speech synthesis and facial expression. In this paper we offer a baseline evaluation of the naturalness and emotional range available by combining the freely available SmartBody component of the Virtual Human Toolkit (VHTK) with CereVoice text to speech (TTS) system. Results echo previous work using pre-recorded prompts, the visual modality is dominant and the modalities do not interact. This allows the speech synthesis to add gradual changes to the perceived emotion both in terms of valence and activation. The naturalness reported is good, 3.54 on a 5 point MOS scale.

Cite

CITATION STYLE

APA

Potard, B., Aylett, M. P., & Baude, D. A. (2016). Cross modal evaluation of high quality emotional speech synthesis with the virtual human toolkit. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10011 LNAI, pp. 190–197). Springer Verlag. https://doi.org/10.1007/978-3-319-47665-0_17

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free