This paper describes experiments on speech segmentation for the purposes of text-to-speech synthesis. We used a bidirectional LSTM neural network for framewise phone classification and another bidirectional LSTM network for predicting the duration of particular phones. The proposed segmentation procedure combines both outputs and finds the optimal speech-phoneme alignment by using the dynamic programming approach. We introduced two modifications to increase the robustness of phoneme classification. Experiments were performed on 2 professional voices and 2 amateur voices. A comparison with a reference HMM-based segmentation with additional manual corrections was performed. Preference listening tests showed that the reference and experimental segmentation are equivalent when used in a unit selection TTS system.
CITATION STYLE
Hanzlíček, Z., Vít, J., & Tihelka, D. (2019). LSTM-Based Speech Segmentation for TTS Synthesis. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11697 LNAI, pp. 361–372). Springer Verlag. https://doi.org/10.1007/978-3-030-27947-9_31
Mendeley helps you to discover research relevant for your work.