Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audios with high naturalness. However, these efforts still suffer from two types of latencies: (a) the computational latency (synthesizing time), which grows linearly with the sentence length, and (b) the input latency in scenarios where the input text is incrementally available (such as in simultaneous translation, dialog generation, and assistive technologies). To reduce these latencies, we propose a neural incremental TTS approach using the prefix-to-prefix framework from simultaneous translation. We synthesize speech in an online fashion, playing a segment of audio while generating the next, resulting in an O(1) rather than O(n) latency. Experiments on English and Chinese TTS show that our approach achieves similar speech naturalness compared to full sentence TTS, but only with a constant (1–2 words) latency.
CITATION STYLE
Ma, M., Zheng, B., Liu, K., Zheng, R., Liu, H., Peng, K., … Huang, L. (2020). Incremental text-to-speech synthesis with prefix-to-prefix framework. In Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (pp. 3886–3896). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.findings-emnlp.346
Mendeley helps you to discover research relevant for your work.