Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning

6Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Voice-face association learning (VFAL) aims to tap into the potential connections between voices and faces. Most studies currently address this problem in a supervised manner, which cannot exploit the wealth of unlabeled video data. To solve this problem, we propose an unsupervised learning framework: Self-Lifting (SL), which can use unlabeled video data for learning. This framework includes two iterative steps of "clustering"and "metric learning". In the first step, unlabeled video data is mapped into the feature space by a coarse model. Then unsupervised clustering is leveraged to allocate pseudo-label to each video. In the second step, the pseudo-label is used as supervisory information to guide the metric learning process, which produces the refined model. These two steps are performed alternately to lift the model's performance. Experiments show that our framework can effectively use unlabeled video data for learning. On the VoxCeleb dataset, our approach achieves SOTA results among the unsupervised methods and has competitive performance compared with the supervised competitors. Our code is released on Github.

Cite

CITATION STYLE

APA

Chen, G., Zhang, D., Liu, T., & Du, X. (2022). Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning. In ICMR 2022 - Proceedings of the 2022 International Conference on Multimedia Retrieval (pp. 527–535). Association for Computing Machinery, Inc. https://doi.org/10.1145/3512527.3531364

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free