Incremental text-to-speech synthesis with prefix-to-prefix framework

Mingbo Ma; Baigong Zheng; Kaibo Liu; Renjie Zheng; Hairong Liu; Kainan Peng; Kenneth Church; Liang Huang

Conference ProceedingsOPEN ACCESS

Incremental text-to-speech synthesis with prefix-to-prefix framework

Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (2020) 3886-3896

DOI: 10.18653/v1/2020.findings-emnlp.346

13Citations

77Readers

Abstract

Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audios with high naturalness. However, these efforts still suffer from two types of latencies: (a) the computational latency (synthesizing time), which grows linearly with the sentence length, and (b) the input latency in scenarios where the input text is incrementally available (such as in simultaneous translation, dialog generation, and assistive technologies). To reduce these latencies, we propose a neural incremental TTS approach using the prefix-to-prefix framework from simultaneous translation. We synthesize speech in an online fashion, playing a segment of audio while generating the next, resulting in an O(1) rather than O(n) latency. Experiments on English and Chinese TTS show that our approach achieves similar speech naturalness compared to full sentence TTS, but only with a constant (1–2 words) latency.

Cite

CITATION STYLE

APA

Ma, M., Zheng, B., Liu, K., Zheng, R., Liu, H., Peng, K., … Huang, L. (2020). Incremental text-to-speech synthesis with prefix-to-prefix framework. In Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (pp. 3886–3896). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.findings-emnlp.346

Incremental text-to-speech synthesis with prefix-to-prefix framework

Abstract

Cite

Register to see more suggestions