Learning Robust Latent Representations for Controllable Speech Synthesis

Shakti Kumar; Jithin Pradeep; Hussain Zaidi

Conference Proceedings

Learning Robust Latent Representations for Controllable Speech Synthesis

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (2021) 3562-3575

DOI: 10.18653/v1/2021.findings-acl.312

2Citations

62Readers

Get full text

Abstract

State-of-the-art Variational Auto-Encoders (VAEs) for learning disentangled latent representations give impressive results in discovering features like pitch, pause duration, and accent in speech data, leading to highly controllable text-to-speech (TTS) synthesis. However, these LSTM-based VAEs fail to learn latent clusters of speaker attributes when trained on limited or noisy datasets. Further, different latent variables are found to encode the same features, limiting the control and expressiveness during speech synthesis. To resolve these issues, we propose REMMI (Reordered transformer Encoder with Minimal Mutual Information) where we minimize the mutual information between different latent variables and devise a modified Transformer architecture with layer reordering to learn controllable latent representations in speech data. We show that REMMI reduces the cluster overlap of speaker attributes by at least 30% over LSTM-VAE.

Cite

CITATION STYLE

APA

Kumar, S., Pradeep, J., & Zaidi, H. (2021). Learning Robust Latent Representations for Controllable Speech Synthesis. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 3562–3575). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.312

Learning Robust Latent Representations for Controllable Speech Synthesis

Abstract

Cite

Register to see more suggestions