An experimental analysis of different approaches to audio–visual speech recognition and lip-reading

8Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we have analyzed different approaches to audio–visual speech recognition. We mainly focused on testing different modalities fusion techniques, rather than other parts of AVSR (e.g., feature extraction methods). Tree audio–visual modalities integration methods were under consideration, namely GMM-CHMM, DNN-HMM and end-to-end approaches, defined as the most promising and commonly found in scientific literature. The testing was performed on two different datasets: on GRID corpus for the English language and on HAVRUS corpus for the Russian. Obtained results once again confirms the superiority of neural network approaches compared to the others in conditions when we have enough data to effectively train NN models, which was demonstrated by our experiments on the GRID dataset. On a more compact in size HAVRUS database, the best recognition results were demonstrated by the traditional GMM-CHMM approach. This paper presents our vision on current state of audio–visual speech recognition field and possible directions for the further research.

Cite

CITATION STYLE

APA

Ivanko, D., Ryumin, D., & Karpov, A. (2021). An experimental analysis of different approaches to audio–visual speech recognition and lip-reading. In Smart Innovation, Systems and Technologies (Vol. 187, pp. 197–209). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-15-5580-0_16

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free