Voice-face association learning (VFAL) aims to tap into the potential connections between voices and faces. Most studies currently address this problem in a supervised manner, which cannot exploit the wealth of unlabeled video data. To solve this problem, we propose an unsupervised learning framework: Self-Lifting (SL), which can use unlabeled video data for learning. This framework includes two iterative steps of "clustering"and "metric learning". In the first step, unlabeled video data is mapped into the feature space by a coarse model. Then unsupervised clustering is leveraged to allocate pseudo-label to each video. In the second step, the pseudo-label is used as supervisory information to guide the metric learning process, which produces the refined model. These two steps are performed alternately to lift the model's performance. Experiments show that our framework can effectively use unlabeled video data for learning. On the VoxCeleb dataset, our approach achieves SOTA results among the unsupervised methods and has competitive performance compared with the supervised competitors. Our code is released on Github.
CITATION STYLE
Chen, G., Zhang, D., Liu, T., & Du, X. (2022). Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning. In ICMR 2022 - Proceedings of the 2022 International Conference on Multimedia Retrieval (pp. 527–535). Association for Computing Machinery, Inc. https://doi.org/10.1145/3512527.3531364
Mendeley helps you to discover research relevant for your work.