Enhancing local dependencies for transformer-based text-to-speech via hybrid lightweight convolution

3Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Owing to the powerful self-attention mechanism, the Transformer network has achieved considerable successes across many sequence modeling tasks and has become one of the most popular methods in text-to-speech (TTS). The vanilla self-attention excels in capturing long-range dependencies but suffers in modeling stable short-range dependencies that are quite important for speech synthesis where the local audio signals are highly correlated. To address this problem, we propose the hybrid lightweight convolution (HLC), which is responsible for fully exploiting local structures of a sequence, and combine it with the self-attention to improve the Transformer-based TTS. The experimental results show that our modified model obtains better performance in both objective and subjective evaluations. At the same time, we also demonstrate that a more compact TTS model may be built through the combination of self-attention and proposed hybrid lightweight convolution. Besides, this method is also potentially adaptable for other sequence modeling tasks.

Cite

CITATION STYLE

APA

Zhao, W., He, T., & Xu, L. (2021). Enhancing local dependencies for transformer-based text-to-speech via hybrid lightweight convolution. IEEE Access, 9, 42762–42770. https://doi.org/10.1109/ACCESS.2021.3065736

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free