Self-Supervised Contrastive Learning for Singing Voices

17Citations
Citations of this article
19Readers
Mendeley users who have this article in their library.

Abstract

This study introduces self-supervised contrastive learning to acquire feature representations of singing voices. To acquire robust representations in an unsupervised manner, regular self-supervised contrastive learning trains neural networks to make the feature representation of a sample close to those of its computationally transformed versions. Similarly, we employ two transformations - pitch shifting and time stretching - considering the nature of singing voices. Nevertheless, we use them reversely: we train networks to push away representations of the transformed versions. The networks then attempt to discriminate changes in vocal timbres introduced by pitch shifting without time stretching and those in singing expressions introduced by time stretching without pitch shifting. Consequently, the acquired representations become attentive to vocal timbre and singing expression. This was confirmed through a singer identification task, where we trained a classifier to learn the relationship between the feature representations to the corresponding singer labels of 500 singers. As a result, the employed transformations helped the classifier improve the classification accuracy by 9.12% (top-1 accuracy: 63.08%) compared with the case where the feature representations fed to the classifier were acquired without the transformations (top-1 accuracy: 53.96%). Furthermore, the proposed approach can be extended to acquire feature representations attentive to either vocal timbre or singing expression but not to the other by changing how the transformations are incorporated. We particularly explored the characteristics of such vocal timbre- or singing expression-oriented feature representations against song genre, singer gender, and vocal technique, and confirmed that they successfully capture different aspects of singing voices.

Cite

CITATION STYLE

APA

Yakura, H., Watanabe, K., & Goto, M. (2022). Self-Supervised Contrastive Learning for Singing Voices. IEEE/ACM Transactions on Audio Speech and Language Processing, 30, 1614–1623. https://doi.org/10.1109/TASLP.2022.3169627

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free