Self-Supervised Contrastive Learning for Singing Voices

Hiromu Yakura; Kento Watanabe; Masataka Goto

Journal ArticleOPEN ACCESS

Self-Supervised Contrastive Learning for Singing Voices

IEEE/ACM Transactions on Audio Speech and Language Processing (2022) 30 1614-1623

DOI: 10.1109/TASLP.2022.3169627

17Citations

19Readers

Abstract

This study introduces self-supervised contrastive learning to acquire feature representations of singing voices. To acquire robust representations in an unsupervised manner, regular self-supervised contrastive learning trains neural networks to make the feature representation of a sample close to those of its computationally transformed versions. Similarly, we employ two transformations - pitch shifting and time stretching - considering the nature of singing voices. Nevertheless, we use them reversely: we train networks to push away representations of the transformed versions. The networks then attempt to discriminate changes in vocal timbres introduced by pitch shifting without time stretching and those in singing expressions introduced by time stretching without pitch shifting. Consequently, the acquired representations become attentive to vocal timbre and singing expression. This was confirmed through a singer identification task, where we trained a classifier to learn the relationship between the feature representations to the corresponding singer labels of 500 singers. As a result, the employed transformations helped the classifier improve the classification accuracy by 9.12% (top-1 accuracy: 63.08%) compared with the case where the feature representations fed to the classifier were acquired without the transformations (top-1 accuracy: 53.96%). Furthermore, the proposed approach can be extended to acquire feature representations attentive to either vocal timbre or singing expression but not to the other by changing how the transformations are incorporated. We particularly explored the characteristics of such vocal timbre- or singing expression-oriented feature representations against song genre, singer gender, and vocal technique, and confirmed that they successfully capture different aspects of singing voices.

Author supplied keywords

Cite

CITATION STYLE

APA

Yakura, H., Watanabe, K., & Goto, M. (2022). Self-Supervised Contrastive Learning for Singing Voices. IEEE/ACM Transactions on Audio Speech and Language Processing, 30, 1614–1623. https://doi.org/10.1109/TASLP.2022.3169627

Self-Supervised Contrastive Learning for Singing Voices

Abstract

Author supplied keywords

Cite

Register to see more suggestions