Audiovisual text-to-speech systems convert a written text into an audiovisual speech signal. Lately much interest goes out to data-driven 2D photorealistic synthesis, where the system uses a database of pre-recorded auditory and visual speech data to construct the target output signal. In this paper we propose a synthesis technique that creates both the target auditory and the target visual speech by using a same audiovisual database. To achieve this, the well-known unit selection synthesis technique is extended to work with multimodal segments containing original combinations of audio and video. This strategy results in a multimodal output signal that displays a high level of audiovisual correlation, which is crucial to achieve a natural perception of the synthetic speech signal. © 2008 Springer-Verlag Berlin Heidelberg.
CITATION STYLE
Mattheyses, W., Latacz, L., Verhelst, W., & Sahli, H. (2008). Multimodal unit selection for 2D audiovisual text-to-speech synthesis. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5237 LNCS, pp. 125–136). Springer Verlag. https://doi.org/10.1007/978-3-540-85853-9_12
Mendeley helps you to discover research relevant for your work.