MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer

11Citations
Citations of this article
30Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

With the development of voice synthesis technology using deep learning, voice synthesis research that expresses the characteristics and emotions of speakers is actively being conducted. Current technology does not satisfactorily express various emotions and characteristics for speakers with very low or high vocal ranges and for speakers with dialects. In this paper, we propose mel-spectrogram image transfer (MIST)-Tacotron, a Tacotron 2-based speech synthesis model that adds a reference encoder with an image style transfer module. The proposed method is a technique that adds image style transfer to the existing Tacotron 2 model and extracts the speaker's feature from the reference mel-spectrogram using a pre-trained deep learning model. Through the extracted feature, the style such as pitch, tone, and duration of the speaker are trained to express the style and emotion of the speaker more clearly. To extract the speaker's style independently from the speaker's timbre and emotion, the ID value for the speaker and the ID value for the emotional state were used as inputs. Performance is evaluated by F0 voiced error (FVE), F0 gross pitch error (F0 GPE), mel-cepstral distortion (MCD), band aperiodicity distortion (BAPD), voiced/unvoiced error (VUVE), false positive rate (FPR), and false negative rate (FNR). The performance of the proposed model was observed to have lower error values than the existing models, GST (Global Style Token) Tacotron and VAE (Variational Autoencoder) Tacotron. As a result of measuring mean opinion score (MOS), the sound quality of the proposed model received the highest score in terms of emotional expression and speaker style reflection.

Cite

CITATION STYLE

APA

Moon, S., Kim, S., & Choi, Y. H. (2022). MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer. IEEE Access, 10, 25455–25463. https://doi.org/10.1109/ACCESS.2022.3156093

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free