An experimental analysis of different approaches to audio–visual speech recognition and lip-reading

Denis Ivanko; Dmitry Ryumin; Alexey Karpov

Conference Proceedings

An experimental analysis of different approaches to audio–visual speech recognition and lip-reading

Smart Innovation, Systems and Technologies (2021) 187 197-209

DOI: 10.1007/978-981-15-5580-0_16

8Citations

4Readers

Get full text

Abstract

In this paper, we have analyzed different approaches to audio–visual speech recognition. We mainly focused on testing different modalities fusion techniques, rather than other parts of AVSR (e.g., feature extraction methods). Tree audio–visual modalities integration methods were under consideration, namely GMM-CHMM, DNN-HMM and end-to-end approaches, defined as the most promising and commonly found in scientific literature. The testing was performed on two different datasets: on GRID corpus for the English language and on HAVRUS corpus for the Russian. Obtained results once again confirms the superiority of neural network approaches compared to the others in conditions when we have enough data to effectively train NN models, which was demonstrated by our experiments on the GRID dataset. On a more compact in size HAVRUS database, the best recognition results were demonstrated by the traditional GMM-CHMM approach. This paper presents our vision on current state of audio–visual speech recognition field and possible directions for the further research.

Author supplied keywords

Cite

CITATION STYLE

APA

Ivanko, D., Ryumin, D., & Karpov, A. (2021). An experimental analysis of different approaches to audio–visual speech recognition and lip-reading. In Smart Innovation, Systems and Technologies (Vol. 187, pp. 197–209). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-15-5580-0_16

An experimental analysis of different approaches to audio–visual speech recognition and lip-reading

Abstract

Author supplied keywords

Cite

Register to see more suggestions