A Lip Reading Method Based on 3D Convolutional Vision Transformer

23Citations
Citations of this article
26Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Lip reading has received increasing attention in recent years. It judges the content of speech based on the movement of the speaker's lips. The rapid development of deep learning has promoted progress in lip reading. However, due to lip reading needs to process the information of continuous video frames, it is necessary to consider the correlation information between adjacent images and the correlation between long-distance images. Moreover, lip reading recognition mainly focuses on the subtle changes of lips and their surrounding environment, and it is necessary to extract the subtle features of small-size images. Therefore, the performance of machine lip reading is generally not high, and the research progress is slow. In order to improve the performance of machine lip reading, we propose a lip reading method based on 3D convolutional vision transformer (3DCvT), which combines vision transformer and 3D convolution to extract the spatio-temporal feature of continuous images, and take full advantage of the properties of convolutions and transformers to extract local and global features from continuous images effectively. The extracted features are then sent to a Bidirectional Gated Recurrent Unit (BiGRU) for sequence modeling. We proved the effectiveness of our method on large-scale lip reading datasets LRW and LRW-1000 and achieved state-of-the-art performance.

Cite

CITATION STYLE

APA

Wang, H., Pu, G., & Chen, T. (2022). A Lip Reading Method Based on 3D Convolutional Vision Transformer. IEEE Access, 10, 77205–77212. https://doi.org/10.1109/ACCESS.2022.3193231

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free