In this paper, we have analyzed different approaches to audio–visual speech recognition. We mainly focused on testing different modalities fusion techniques, rather than other parts of AVSR (e.g., feature extraction methods). Tree audio–visual modalities integration methods were under consideration, namely GMM-CHMM, DNN-HMM and end-to-end approaches, defined as the most promising and commonly found in scientific literature. The testing was performed on two different datasets: on GRID corpus for the English language and on HAVRUS corpus for the Russian. Obtained results once again confirms the superiority of neural network approaches compared to the others in conditions when we have enough data to effectively train NN models, which was demonstrated by our experiments on the GRID dataset. On a more compact in size HAVRUS database, the best recognition results were demonstrated by the traditional GMM-CHMM approach. This paper presents our vision on current state of audio–visual speech recognition field and possible directions for the further research.
CITATION STYLE
Ivanko, D., Ryumin, D., & Karpov, A. (2021). An experimental analysis of different approaches to audio–visual speech recognition and lip-reading. In Smart Innovation, Systems and Technologies (Vol. 187, pp. 197–209). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-15-5580-0_16
Mendeley helps you to discover research relevant for your work.