LSTM-Based Speech Segmentation for TTS Synthesis

Zdeněk Hanzlíček; Jakub Vít; Daniel Tihelka

Conference Proceedings

LSTM-Based Speech Segmentation for TTS Synthesis

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019) 11697 LNAI 361-372

DOI: 10.1007/978-3-030-27947-9_31

7Citations

4Readers

Get full text

Abstract

This paper describes experiments on speech segmentation for the purposes of text-to-speech synthesis. We used a bidirectional LSTM neural network for framewise phone classification and another bidirectional LSTM network for predicting the duration of particular phones. The proposed segmentation procedure combines both outputs and finds the optimal speech-phoneme alignment by using the dynamic programming approach. We introduced two modifications to increase the robustness of phoneme classification. Experiments were performed on 2 professional voices and 2 amateur voices. A comparison with a reference HMM-based segmentation with additional manual corrections was performed. Preference listening tests showed that the reference and experimental segmentation are equivalent when used in a unit selection TTS system.

Author supplied keywords

Cite

CITATION STYLE

APA

Hanzlíček, Z., Vít, J., & Tihelka, D. (2019). LSTM-Based Speech Segmentation for TTS Synthesis. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11697 LNAI, pp. 361–372). Springer Verlag. https://doi.org/10.1007/978-3-030-27947-9_31

LSTM-Based Speech Segmentation for TTS Synthesis

Abstract

Author supplied keywords

Cite

Register to see more suggestions