Improving audio-visual speech recognition using gabor recurrent neural networks

1Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

The performance of speech recognition systems can be significantly improved when visual information is used in conjunction with the audio ones, especially in noisy environments. Prompted by the great achievements of deep learning in solving Audio-Visual Speech Recognition (AVSR) problems, we propose a deep AVSR model based on Long Short-Term Memory Bidirectional Recurrent Neural Network (LSTM-BRNN). The proposed deep AVSR model utilizes the Gabor filters in both the audio and visual front-ends with Early Integration (EI) scheme. This model is termed as BRNN av model. The Gabor features simulate the underlying spatiotemporal processing chain that occurs in the Primary Audio Cortex (PAC) in conjunction with Primary Visual Cortex (PVC). We named it Gabor Audio Features (GAF) and Gabor Visual Features (GVF). The experimental results show that the deep Gabor (LSTM-BRNNs)-based model achieves superior performance when compared to the (GMM-HMM)-based models which utilize the same front-ends. Furthermore, the use of GAF and GVF in both audio and visual front-ends attain significant improvement in the performance compared to the traditional audio and visual features.

Cite

CITATION STYLE

APA

Saudi, A. S., Khalil, M. I., & Abbas, H. M. (2019). Improving audio-visual speech recognition using gabor recurrent neural networks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11377 LNAI, pp. 71–83). Springer Verlag. https://doi.org/10.1007/978-3-030-20984-1_7

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free