Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.
CITATION STYLE
Miao, C., Zhu, Q., Chen, M., Ma, J., Wang, S., & Xiao, J. (2024). EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion. IEEE/ACM Transactions on Audio Speech and Language Processing, 32, 1650–1661. https://doi.org/10.1109/TASLP.2024.3369528
Mendeley helps you to discover research relevant for your work.