LSTM-Based Speech Segmentation for TTS Synthesis

7Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper describes experiments on speech segmentation for the purposes of text-to-speech synthesis. We used a bidirectional LSTM neural network for framewise phone classification and another bidirectional LSTM network for predicting the duration of particular phones. The proposed segmentation procedure combines both outputs and finds the optimal speech-phoneme alignment by using the dynamic programming approach. We introduced two modifications to increase the robustness of phoneme classification. Experiments were performed on 2 professional voices and 2 amateur voices. A comparison with a reference HMM-based segmentation with additional manual corrections was performed. Preference listening tests showed that the reference and experimental segmentation are equivalent when used in a unit selection TTS system.

Cite

CITATION STYLE

APA

Hanzlíček, Z., Vít, J., & Tihelka, D. (2019). LSTM-Based Speech Segmentation for TTS Synthesis. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11697 LNAI, pp. 361–372). Springer Verlag. https://doi.org/10.1007/978-3-030-27947-9_31

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free