Naturalness Improvement of Vietnamese Text-to-Speech System Using Diffusion Probabilistic Modelling and Unsupervised Data Enrichment

Tung Tran; Tuan Nguyen; Hung Bui; Khuong Nguyen; Nghia Gia Vo; Tran Vu Pham; Tho Quan

Book Chapter

Naturalness Improvement of Vietnamese Text-to-Speech System Using Diffusion Probabilistic Modelling and Unsupervised Data Enrichment

Springer Science and Business Media Deutschland GmbH, (2022), 376-387

DOI: 10.1007/978-3-031-15063-0_36

1Citations

2Readers

Get full text

Abstract

Speech synthesis, which aims to generate natural and comprehensible speech from input text, is a popular research topic with a wide range of industrial applications. However, it appears to be a difficult problem due to its strong dependency on data, particularly for accent-sensitive and multi-dialect languages, e.g. Vietnamese. Perhaps the most common model applied in this area is Tacotron 2, using Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) architectures. Still, Tacotron 2 has not yet achieved the expected naturalness, possibly because it was not sophisticated enough to capture the natural expression of human voice. Moreover, with a low-resource language like Vietnamese, to collect a sufficient training dataset for this task is also a non-trivial problem. Hence, in this paper we propose an end-to-end framework with Grad-TTS, a denoising diffusion probabilistic model, as an acoustic model in the Text-to-speech (TTS) system instead of the traditional approach employed by Tacotron 2. The proposed approach helps us achieved a more natural synthesized speech, as depicted in the experiments. Furthermore, we also introduce an unsupervised approach to collect Vietnamese data from the Internet resource as well as to pre-process the data before putting it into training. This helps solve the problem of lacking Vietnamese data, and enhance our outcome. We released the dataset for further development of TTS system for Vietnamese at: https://bit.ly/3rnNsFi.

Author supplied keywords

Cite

CITATION STYLE

APA

Tran, T., Nguyen, T., Bui, H., Nguyen, K., Vo, N. G., Pham, T. V., & Quan, T. (2022). Naturalness Improvement of Vietnamese Text-to-Speech System Using Diffusion Probabilistic Modelling and Unsupervised Data Enrichment. In Lecture Notes on Data Engineering and Communications Technologies (Vol. 148, pp. 376–387). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-15063-0_36

Naturalness Improvement of Vietnamese Text-to-Speech System Using Diffusion Probabilistic Modelling and Unsupervised Data Enrichment

Abstract

Author supplied keywords

Cite

Register to see more suggestions