Speech Guided Disentangled Visual Representation Learning for Lip Reading

2Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Lip reading has achieved unparalleled development in recent years. However, existing methods have two main problems: 1) there is no explicit mechanism to ensure that the extracted visual features are only related to lip movements, resulting in degraded performance when video contains large variations, such as speakers' poses; 2) quantities of labeled data are required to achieve good results, which are difficult to obtain in low-resource languages. In this paper, we propose a new visual representation learning method, SVLR, whose purpose is to extract disentangled, lip movements related visual features for lip reading task, by making use of quantities of unlabeled audio-visual data. This is achieved by explicitly disentangling the feature into lip movements related part and speaker identity related part. Then predicting speech from the disentangled features is used as the training objective to optimize model parameters. After this cross-modal training, the video encoder that extracts lip movements features is used as a feature extractor for the lip reading task. Various experiments on several word-level lip reading benchmarks have proved the effectiveness of the proposed method.

Cite

CITATION STYLE

APA

Zhao, Y., Ma, C., Feng, Z., & Song, M. (2021). Speech Guided Disentangled Visual Representation Learning for Lip Reading. In ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction (pp. 687–691). Association for Computing Machinery, Inc. https://doi.org/10.1145/3462244.3479952

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free