Deciphering Speech Through Vision: A Deep Learning Lip Reading System

0Citations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Lip-reading bridges computer vision and speech processing by recognizing spoken words from visual lip movements alone. This study presents a streamlined framework combining facial landmark detection, image enhancement, and deep spatiotemporal modeling. We use MTCNN to detect and align lip regions, enhanced by Real-ESRGAN for higher resolution and finer detail. The enhanced images feed into a 3D CNN with timedistributed layers and bidirectional LSTM, trained using CTC loss for effective spatial-temporal feature learning and alignmentfree transcription. Evaluated on the GRID corpus, our model achieves a character error rate (CER) of 2.3% on seen speakers and 5.2% on unseen speakers. Overall, it delivers state-of-the-art performance with a 5.2% CER and 95.8% accuracy, improving CER by 18.8% over LipNet. Notably, for unseen speakers, it reduces CER from LipNet's 9.4% to 5.2%, a 44.7% relative decrease, showcasing superior generalization and robustness. These results highlight that combining super-resolution with deep temporal modeling substantially enhances visual speech recognition accuracy and reliability.

Cite

CITATION STYLE

APA

Ambati, S. R., & Akther, S. (2025). Deciphering Speech Through Vision: A Deep Learning Lip Reading System. In 3rd IEEE International Conference on Data Science and Network Security, ICDSNS 2025. Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/ICDSNS65743.2025.11168793

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free