Prosodic features control by symbols as input of sequence-to-sequence acoustic modeling for neural TTS

30Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.

Abstract

This paper describes a method to control prosodic features using phonetic and prosodic symbols as input of attention-based sequenceto- sequence (seq2seq) acoustic modeling (AM) for neural text-to-speech (TTS). The method involves inserting a sequence of prosodic symbols between phonetic symbols that are then used to reproduce prosodic acoustic features, i.e. accents, pauses, accent breaks, and sentence endings, in several seq2seq AM methods. The proposed phonetic and prosodic labels have simple descriptions and a low production cost. By contrast, the labels of conventional statistical parametric speech synthesis methods are complicated, and the cost of time alignments such as aligning the boundaries of phonemes is high. The proposed method does not need the boundary positions of phonemes. We propose an automatic conversion method for conventional labels and show how to automatically reproduce pitch accents and phonemes. The results of objective and subjective evaluations show the effectiveness of our method.

Cite

CITATION STYLE

APA

KURIHARA, K., SEIYAMA, N., & KUMANO, T. (2021). Prosodic features control by symbols as input of sequence-to-sequence acoustic modeling for neural TTS. IEICE Transactions on Information and Systems, E104D(2), 302–311. https://doi.org/10.1587/transinf.2020EDP7104

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free