In recent years, Text-To-Speech (TTS) technology has developed rapidly. People have also been paying more attention to how to narrow the gap between synthetic speech and real speech, hoping that synthesized speech can be integrated with real rhythm. A rhythmic feature embedding method for Text-To-Speech was proposed in this thesis based on Tacotron2 model, which has arisen in the field of TTS in recent years. Firstly, rhythmic feature extraction through World vocoder can reduce redundant information in rhythmic features. Then, rhythmic feature fusion based on Variational Auto-Encoder (VAE) network can enhance rhythmic information. Experiments are carried out on the data set LJSpeech-1.0, and then subjective evaluation and objective evaluation are carried out on the synthesized speech respectively. Compared with the comparative literature, the subjective blind hearing test (ABX) score increased by 25%. At that same time, the objective Mel Cepstral Distortion value (MCD) declined to 12.77.
Mendeley helps you to discover research relevant for your work.
CITATION STYLE
Wu, T., Zhao, L., & Zhang, Q. (2020). Research on Speech Synthesis Technology Based on Rhythm Embedding. In Journal of Physics: Conference Series (Vol. 1693). IOP Publishing Ltd. https://doi.org/10.1088/1742-6596/1693/1/012127