Abstract
State-of-the-art Variational Auto-Encoders (VAEs) for learning disentangled latent representations give impressive results in discovering features like pitch, pause duration, and accent in speech data, leading to highly controllable text-to-speech (TTS) synthesis. However, these LSTM-based VAEs fail to learn latent clusters of speaker attributes when trained on limited or noisy datasets. Further, different latent variables are found to encode the same features, limiting the control and expressiveness during speech synthesis. To resolve these issues, we propose REMMI (Reordered transformer Encoder with Minimal Mutual Information) where we minimize the mutual information between different latent variables and devise a modified Transformer architecture with layer reordering to learn controllable latent representations in speech data. We show that REMMI reduces the cluster overlap of speaker attributes by at least 30% over LSTM-VAE.
Cite
CITATION STYLE
Kumar, S., Pradeep, J., & Zaidi, H. (2021). Learning Robust Latent Representations for Controllable Speech Synthesis. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 3562–3575). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.312
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.