Learning Robust Latent Representations for Controllable Speech Synthesis

2Citations
Citations of this article
62Readers
Mendeley users who have this article in their library.
Get full text

Abstract

State-of-the-art Variational Auto-Encoders (VAEs) for learning disentangled latent representations give impressive results in discovering features like pitch, pause duration, and accent in speech data, leading to highly controllable text-to-speech (TTS) synthesis. However, these LSTM-based VAEs fail to learn latent clusters of speaker attributes when trained on limited or noisy datasets. Further, different latent variables are found to encode the same features, limiting the control and expressiveness during speech synthesis. To resolve these issues, we propose REMMI (Reordered transformer Encoder with Minimal Mutual Information) where we minimize the mutual information between different latent variables and devise a modified Transformer architecture with layer reordering to learn controllable latent representations in speech data. We show that REMMI reduces the cluster overlap of speaker attributes by at least 30% over LSTM-VAE.

Cite

CITATION STYLE

APA

Kumar, S., Pradeep, J., & Zaidi, H. (2021). Learning Robust Latent Representations for Controllable Speech Synthesis. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 3562–3575). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.312

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free