Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages

Kurniawati Azizah; Mirna Adriani; Wisnu Jatmiko

Journal ArticleOPEN ACCESS

Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages

IEEE Access (2020) 8 179798-179812

DOI: 10.1109/ACCESS.2020.3027619

22Citations

69Readers

Abstract

This work applies a hierarchical transfer learning to implement deep neural network (DNN)-based multilingual text-to-speech (TTS) for low-resource languages. DNN-based system typically requires a large amount of training data. In recent years, while DNN-based TTS has made remarkable results for high-resource languages, it still suffers from a data scarcity problem for low-resource languages. In this article, we propose a multi-stage transfer learning strategy to train our TTS model for low-resource languages. We make use of a high-resource language and a joint multilingual dataset of low-resource languages. A pre-trained monolingual TTS on the high-resource language is fine-tuned on the low-resource language using the same model architecture. Then, we apply partial network-based transfer learning from the pre-trained monolingual TTS to a multilingual TTS and finally from the pre-trained multilingual TTS to a multilingual with style transfer TTS. Our experiment on Indonesian, Javanese, and Sundanese languages show adequate quality of synthesized speech. The evaluation of our multilingual TTS reaches a mean opinion score (MOS) of 4.35 for Indonesian (ground truth = 4.36). Whereas for Javanese and Sundanese it reaches a MOS of 4.20 (ground truth = 4.38) and 4.28 (ground truth = 4.20), respectively. For parallel style transfer evaluation, our TTS model reaches an F0 frame error (FFE) of 9.08%, 10.13%, and 8.43% for Indonesian, Javanese, and Sundanese, respectively. The results indicate that the proposed strategy can be effectively applied to the low-resource languages target domain. With a small amount of training data, our models are able to learn step by step from a smaller TTS network to larger networks, produce intelligible speech approaching the real human voice, and successfully transfer speaking style from a reference audio.

Author supplied keywords

Cite

CITATION STYLE

APA

Azizah, K., Adriani, M., & Jatmiko, W. (2020). Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages. IEEE Access, 8, 179798–179812. https://doi.org/10.1109/ACCESS.2020.3027619

Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages

Abstract

Author supplied keywords

Cite

Register to see more suggestions