Building Mongolian TTS Front-End with Encoder-Decoder Model by Using Bridge Method and Multi-view Features

5Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In the context of text-to-speech systems (TTS), a front-end is a critical step for extracting linguistic features from given input text. In this paper, we propose a Mongolian TTS front-end which joint training Grapheme-to-Phoneme conversion (G2P) and phrase break prediction (PB). We use a bidirectional long short-term memory (LSTM) network as the encoder side, and build two decoders for G2P and PB that share the same encoder. Meanwhile, we put the source input features and encoder hidden states together into the Decoder, aim to shorten the distance between the source and target sequence and learn the alignment information better. More importantly, to obtain a robust representation for Mongolian words, which are agglutinative in nature and lacks sufficient training corpus, we design specific multi-view input features for it. Our subjective and objective experiments have demonstrated the effectiveness of this proposal.

Cite

CITATION STYLE

APA

Liu, R., Bao, F., & Gao, G. (2019). Building Mongolian TTS Front-End with Encoder-Decoder Model by Using Bridge Method and Multi-view Features. In Communications in Computer and Information Science (Vol. 1143 CCIS, pp. 642–651). Springer. https://doi.org/10.1007/978-3-030-36802-9_68

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free