Abstract
Lip-reading bridges computer vision and speech processing by recognizing spoken words from visual lip movements alone. This study presents a streamlined framework combining facial landmark detection, image enhancement, and deep spatiotemporal modeling. We use MTCNN to detect and align lip regions, enhanced by Real-ESRGAN for higher resolution and finer detail. The enhanced images feed into a 3D CNN with timedistributed layers and bidirectional LSTM, trained using CTC loss for effective spatial-temporal feature learning and alignmentfree transcription. Evaluated on the GRID corpus, our model achieves a character error rate (CER) of 2.3% on seen speakers and 5.2% on unseen speakers. Overall, it delivers state-of-the-art performance with a 5.2% CER and 95.8% accuracy, improving CER by 18.8% over LipNet. Notably, for unseen speakers, it reduces CER from LipNet's 9.4% to 5.2%, a 44.7% relative decrease, showcasing superior generalization and robustness. These results highlight that combining super-resolution with deep temporal modeling substantially enhances visual speech recognition accuracy and reliability.
Author supplied keywords
Cite
CITATION STYLE
Ambati, S. R., & Akther, S. (2025). Deciphering Speech Through Vision: A Deep Learning Lip Reading System. In 3rd IEEE International Conference on Data Science and Network Security, ICDSNS 2025. Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/ICDSNS65743.2025.11168793
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.