Speech synthesis, which aims to generate natural and comprehensible speech from input text, is a popular research topic with a wide range of industrial applications. However, it appears to be a difficult problem due to its strong dependency on data, particularly for accent-sensitive and multi-dialect languages, e.g. Vietnamese. Perhaps the most common model applied in this area is Tacotron 2, using Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) architectures. Still, Tacotron 2 has not yet achieved the expected naturalness, possibly because it was not sophisticated enough to capture the natural expression of human voice. Moreover, with a low-resource language like Vietnamese, to collect a sufficient training dataset for this task is also a non-trivial problem. Hence, in this paper we propose an end-to-end framework with Grad-TTS, a denoising diffusion probabilistic model, as an acoustic model in the Text-to-speech (TTS) system instead of the traditional approach employed by Tacotron 2. The proposed approach helps us achieved a more natural synthesized speech, as depicted in the experiments. Furthermore, we also introduce an unsupervised approach to collect Vietnamese data from the Internet resource as well as to pre-process the data before putting it into training. This helps solve the problem of lacking Vietnamese data, and enhance our outcome. We released the dataset for further development of TTS system for Vietnamese at: https://bit.ly/3rnNsFi.
CITATION STYLE
Tran, T., Nguyen, T., Bui, H., Nguyen, K., Vo, N. G., Pham, T. V., & Quan, T. (2022). Naturalness Improvement of Vietnamese Text-to-Speech System Using Diffusion Probabilistic Modelling and Unsupervised Data Enrichment. In Lecture Notes on Data Engineering and Communications Technologies (Vol. 148, pp. 376–387). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-15063-0_36
Mendeley helps you to discover research relevant for your work.