EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

4Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.

Cite

CITATION STYLE

APA

Miao, C., Zhu, Q., Chen, M., Ma, J., Wang, S., & Xiao, J. (2024). EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion. IEEE/ACM Transactions on Audio Speech and Language Processing, 32, 1650–1661. https://doi.org/10.1109/TASLP.2024.3369528

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free