Acoustic word embeddings for end-to-end speech synthesis

Feiyu Shen; Chenpeng Du; Kai Yu

Journal ArticleOPEN ACCESS

Acoustic word embeddings for end-to-end speech synthesis

Applied Sciences (Switzerland) (2021) 11(19)

DOI: 10.3390/app11199010

3Citations

6Readers

Abstract

The most recent end-to-end speech synthesis systems use phonemes as acoustic input tokens and ignore the information about which word the phonemes come from. However, many words have their specific prosody type, which may significantly affect the naturalness. Prior works have employed pre-trained linguistic word embeddings as TTS system input. However, since linguistic information is not directly relevant to how words are pronounced, TTS quality improvement of these systems is mild. In this paper, we propose a novel and effective way of jointly training acoustic phone and word embeddings for end-to-end TTS systems. Experiments on the LJSpeech dataset show that the acoustic word embeddings dramatically decrease both the training and validation loss in phone-level prosody prediction. Subjective evaluations on naturalness demonstrate that the incorporation of acoustic word embeddings can significantly outperform both pure phone-based system and the TTS system with pre-trained linguistic word embedding.

Author supplied keywords

Cite

CITATION STYLE

APA

Shen, F., Du, C., & Yu, K. (2021). Acoustic word embeddings for end-to-end speech synthesis. Applied Sciences (Switzerland), 11(19). https://doi.org/10.3390/app11199010

Acoustic word embeddings for end-to-end speech synthesis

Abstract

Author supplied keywords

Cite

Register to see more suggestions