Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training

Changchong Sheng; Matti Pietikäinen; Qi Tian; Li Liu

Conference ProceedingsOPEN ACCESS

Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training

MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (2021) 2456-2464

DOI: 10.1145/3474085.3475415

12Citations

14Readers

Get full text

Abstract

The goal of this work is to learn discriminative visual representations for lip reading without access to manual text annotation. Recent advances in cross-modal self-supervised learning have shown that the corresponding audio can serve as a supervisory signal to learn effective visual representations for lip reading. However, existing methods only exploit the natural synchronization of the video and the corresponding audio. We find that both video and audio are actually composed of speech-related information, identity-related information, and modal information. To make the visual representations (i) more discriminative for lip reading and (ii) indiscriminate with respect to the identities and modals, we propose a novel self-supervised learning framework called Adversarial Dual-Contrast Self-Supervised Learning (ADC-SSL), to go beyond previous methods by explicitly forcing the visual representations disentangled from speech-unrelated information. Experimental results clearly show that the proposed method outperforms state-of-the-art cross-modal self-supervised baselines by a large margin. Besides, ADC-SSL can outperform its supervised counterpart without any finetune.

Author supplied keywords

Cite

CITATION STYLE

APA

Sheng, C., Pietikäinen, M., Tian, Q., & Liu, L. (2021). Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training. In MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (pp. 2456–2464). Association for Computing Machinery, Inc. https://doi.org/10.1145/3474085.3475415

Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training

Abstract

Author supplied keywords

Cite

Register to see more suggestions