Investigation of Japanese PnG BERT Language Model in Text-to-Speech Synthesis for Pitch Accent Language

Yusuke Yasuda; Tomoki Toda

Journal ArticleOPEN ACCESS

Investigation of Japanese PnG BERT Language Model in Text-to-Speech Synthesis for Pitch Accent Language

IEEE Journal on Selected Topics in Signal Processing (2022) 16(6) 1319-1328

DOI: 10.1109/JSTSP.2022.3190672

6Citations

22Readers

Abstract

End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech from raw text. However, rendering the correct pitch accents is still a challenging problem for end-to-end TTS. To tackle the challenge of rendering correct pitch accent in Japanese end-to-end TTS, we adopt PnG BERT, a self-supervised pretrained model in the character and phoneme domain for TTS. We investigate the effects of features captured by PnG BERT on Japanese TTS by modifying the fine-tuning condition to determine the conditions helpful inferring pitch accents. We manipulate content of PnG BERT features from being text-oriented to speech-oriented by changing the number of fine-tuned layers during TTS. In addition, we teach PnG BERT pitch accent information by fine-tuning with tone prediction as an additional downstream task. Our experimental results show that the features of PnG BERT captured by pretraining contain information helpful inferring pitch accent, and PnG BERT outperforms baseline Tacotron on accent correctness in a listening test.

Author supplied keywords

Cite

CITATION STYLE

APA

Yasuda, Y., & Toda, T. (2022). Investigation of Japanese PnG BERT Language Model in Text-to-Speech Synthesis for Pitch Accent Language. IEEE Journal on Selected Topics in Signal Processing, 16(6), 1319–1328. https://doi.org/10.1109/JSTSP.2022.3190672

Investigation of Japanese PnG BERT Language Model in Text-to-Speech Synthesis for Pitch Accent Language

Abstract

Author supplied keywords

Cite

Register to see more suggestions