Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for End-to-end Singing Voice Synthesis

Tao Wang; Ruibo Fu; Jiangyan Yi; Zhengqi Wen; Jianhua Tao

Conference ProceedingsOPEN ACCESS

Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for End-to-end Singing Voice Synthesis

DDAM 2022 - Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia (2022) 53-59

DOI: 10.1145/3552466.3556534

3Citations

14Readers

Get full text

Abstract

End-to-end singing voice synthesis (SVS) is attractive due to the avoidance of pre-aligned data. However, the auto learned alignment of singing voice with lyrics is difficult to match the duration information in musical score, which will lead to the model instability or even failure to synthesize voice. To learn accurate alignment information automatically, this paper proposes an end-to-end SVS framework, named Singing-Tacotron. The main difference between the proposed framework and Tacotron is that the speech can be controlled significantly by the musical score's duration information. Firstly, we propose a global duration control attention mechanism for the SVS model. The attention mechanism can control each phoneme's duration. Secondly, a duration encoder is proposed to learn a set of global transition tokens from the musical score. These transition tokens can help the attention mechanism decide whether moving to the next phoneme or staying at each decoding step. Thirdly, to further improve the model's stability, a dynamic filter is designed to help the model overcome noise interference and pay more attention to local context information. Subjective and objective evaluation 1 verify the effectiveness of the method. Furthermore, the role of global transition tokens and the effect of duration control are explored.

Author supplied keywords

Cite

CITATION STYLE

APA

Wang, T., Fu, R., Yi, J., Wen, Z., & Tao, J. (2022). Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for End-to-end Singing Voice Synthesis. In DDAM 2022 - Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia (pp. 53–59). Association for Computing Machinery, Inc. https://doi.org/10.1145/3552466.3556534

Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for End-to-end Singing Voice Synthesis

Abstract

Author supplied keywords

Cite

Register to see more suggestions