EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

Chenfeng Miao; Qingying Zhu; Minchuan Chen; Jun Ma; Shaojun Wang; Jing Xiao

Journal ArticleOPEN ACCESS

EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

IEEE/ACM Transactions on Audio Speech and Language Processing (2024) 32 1650-1661

DOI: 10.1109/TASLP.2024.3369528

4Citations

14Readers

Abstract

Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.

Author supplied keywords

Cite

CITATION STYLE

APA

Miao, C., Zhu, Q., Chen, M., Ma, J., Wang, S., & Xiao, J. (2024). EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion. IEEE/ACM Transactions on Audio Speech and Language Processing, 32, 1650–1661. https://doi.org/10.1109/TASLP.2024.3369528

EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

Abstract

Author supplied keywords

Cite

Register to see more suggestions