This paper aims at improving naturalness of synthesized speech generated by a text-to-speech (TTS) system within a spoken dialogue system with respect to “how natural the system’s intention is perceived via the synthesized speech”. We call this measure “illocutionary act naturalness” in this paper. To achieve this aim, we propose to utilize dialogue-act (DA) information as an auxiliary feature for a deep neural network (DNN)-based speech synthesis system. First, we construct a speech database with DA tags. Second, we build the proposed DNN-based speech synthesis system based on the database. Then, we evaluate the proposed method by comparing its performance with two conventional hidden Markov model (HMM)-based speech synthesis systems, namely, the style-mixed modeling method and the style adaptation method. The objective evaluation results show that the proposed method overwhelms the style-mixed modeling method in the accuracy of reproduction of global prosodic characteristics of dialogue-acts. They also reveal that the proposed method overwhelms the style adaptation method in the accuracy of reproduction of sentence final tone characteristics of dialogue-acts. The subjective evaluation results also show that the proposed method improves the illocutionary act naturalness compared with the two conventional methods.
CITATION STYLE
Hojo, N., Ijima, Y., Sugiyama, H., Miyazaki, N., Kawanishi, T., & Kashino, K. (2020). Dnn-based speech synthesis using dialogue-act information and its evaluation with respect to illocutionary act naturalness. Transactions of the Japanese Society for Artificial Intelligence, 35(2), 1–17. https://doi.org/10.1527/tjsai.A-J81
Mendeley helps you to discover research relevant for your work.