Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-Based TTS

19Citations
Citations of this article
22Readers
Mendeley users who have this article in their library.

Abstract

Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this letter, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

Author supplied keywords

Cite

CITATION STYLE

APA

Liu, R., Sisman, B., Bao, F., Gao, G., & Li, H. (2020). Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-Based TTS. IEEE Signal Processing Letters, 27, 1470–1474. https://doi.org/10.1109/LSP.2020.3016564

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free