Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement

Yazhu Liu; Shaofei Xue; Jian Tang

Conference Proceedings

Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement

Communications in Computer and Information Science (2023) 1765 CCIS 162-172

DOI: 10.1007/978-981-99-2401-1_15

0Citations

1Readers

Get full text

Abstract

As the development of deep learning, neural network (NN) based text-to-speech (TTS) that adopts deep neural networks as the model backbone for speech synthesis, has now become the mainstream technology for TTS. Compared to the previous TTS systems based on concatenative synthesis and statistical parametric synthesis, the NN based speech synthesis shows conspicuous advantages. It needs less requirement on human pre-processing and feature development, and brings high-quality voice in terms of both intelligibility and naturalness. However, robust NN based speech synthesis model typically requires a sizable set of high-quality data for training, which is expensive to collect especially in low-resource scenarios. It is worth investigating how to take advantage of low-quality material such as automatic speech recognition (ASR) data which can be easily obtained compared with high-quality TTS material. In this paper, we propose a pre-training technique framework to improve the performance of low-resource speech synthesis. The idea is to extend the training material of TTS model by using ASR based data augmentation method. Specifically, we first build a frame-wise phoneme classification network on the ASR dataset and extract the semi-supervised paired data from large-scale speech corpora. We then pre-train the NN based TTS acoustic model by using the semi-supervised pairs. Finally, we fine-tune the model with a small amount of available paired data. Experimental results show that our proposed framework enables the TTS model to generate more intelligible and natural speech with the same amount of paired training data.

Author supplied keywords

Cite

CITATION STYLE

APA

Liu, Y., Xue, S., & Tang, J. (2023). Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement. In Communications in Computer and Information Science (Vol. 1765 CCIS, pp. 162–172). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-99-2401-1_15

Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement

Abstract

Author supplied keywords

Cite

Register to see more suggestions