Incremental text-to-speech synthesis with prefix-to-prefix framework

13Citations
Citations of this article
77Readers
Mendeley users who have this article in their library.

Abstract

Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audios with high naturalness. However, these efforts still suffer from two types of latencies: (a) the computational latency (synthesizing time), which grows linearly with the sentence length, and (b) the input latency in scenarios where the input text is incrementally available (such as in simultaneous translation, dialog generation, and assistive technologies). To reduce these latencies, we propose a neural incremental TTS approach using the prefix-to-prefix framework from simultaneous translation. We synthesize speech in an online fashion, playing a segment of audio while generating the next, resulting in an O(1) rather than O(n) latency. Experiments on English and Chinese TTS show that our approach achieves similar speech naturalness compared to full sentence TTS, but only with a constant (1–2 words) latency.

Cite

CITATION STYLE

APA

Ma, M., Zheng, B., Liu, K., Zheng, R., Liu, H., Peng, K., … Huang, L. (2020). Incremental text-to-speech synthesis with prefix-to-prefix framework. In Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (pp. 3886–3896). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.findings-emnlp.346

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free