Ensemble deep neural network based waveform-driven stress model for speech synthesis

0Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Stress annotations in the training corpus of speech synthesis systems are usually obtained by applying language rules to the transcripts. However, the actual stress patterns seen in the waveform are not guaranteed to be canonical, they can deviate from locations defined by language rules. This is driven mostly by speaker dependent factors. Therefore, stress models based on these corpora can be far from perfect. This paper proposes a waveform based stress annotation technique. According to the stress classes, four feedforward deep neural networks (DNNs) were trained to model fundamental frequency (F0) of speech. During synthesis, stress labels are generated from the textual input and an ensemble of the four DNNs predict the F0 trajectories. Objective and subjective evaluation was carried out. The results show that the proposed method surpasses the quality of vanilla DNN-based F0 models.

Cite

CITATION STYLE

APA

Tóth, B. P., Kis, K. I., Szaszák, G., & Németh, G. (2016). Ensemble deep neural network based waveform-driven stress model for speech synthesis. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9811 LNCS, pp. 271–278). Springer Verlag. https://doi.org/10.1007/978-3-319-43958-7_32

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free