Generating coherent spontaneous speech and gesture from text

Simon Alexanderson; Éva Székely; Gustav Eje Henter; Taras Kucherenko; Jonas Beskow

Conference ProceedingsOPEN ACCESS

Generating coherent spontaneous speech and gesture from text

Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, IVA 2020 (2020)

DOI: 10.1145/3383652.3423874

15Citations

27Readers

Get full text

Abstract

Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data. We illustrate our results by visualising gesture spaces and textspeech-gesture alignments, and through a demonstration video.

Author supplied keywords

Cite

CITATION STYLE

APA

Alexanderson, S., Székely, É., Henter, G. E., Kucherenko, T., & Beskow, J. (2020). Generating coherent spontaneous speech and gesture from text. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, IVA 2020. Association for Computing Machinery, Inc. https://doi.org/10.1145/3383652.3423874

Generating coherent spontaneous speech and gesture from text

Abstract

Author supplied keywords

Cite

Register to see more suggestions