DNN-based duration modeling for synthesizing short sentences

2Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Statistical parametric speech synthesis conventionally utilizes decision tree clustered context-dependent hidden Markov models (HMMs) to model speech parameters. But decision trees are unable to capture complex context dependencies and fail to model the interaction between linguistic features. Recently deep neural networks (DNNs) have been applied in speech synthesis and they can address some of these limitations. This paper focuses on the prediction of phone durations in Text-to-Speech (TTS) systems using feedforward DNNs in case of short sentences (sentences containing one, two or three syllables only). To achieve better prediction accuracy hyperparameter optimization was carried out with manual grid search. Recordings from a male and a female speaker were used to train the systems, and the output of various configurations were compared against conventional HMM-based solutions and natural speech. Experimental results of objective evaluations show that DNNs can outperform previous state-of-the-art solutions in duration modeling.

Cite

CITATION STYLE

APA

Nagy, P., & Németh, G. (2016). DNN-based duration modeling for synthesizing short sentences. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9811 LNCS, pp. 254–261). Springer Verlag. https://doi.org/10.1007/978-3-319-43958-7_30

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free