WaveNet with Cross-Attention for Audiovisual Speech Recognition

9Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.

Abstract

In this paper, theWaveNet with cross-attention is proposed for Audio-Visual Automatic Speech Recognition (AV-ASR) to address multimodal feature fusion and frame alignment problems between two data streams.WaveNet is usually used for speech generation and speech recognition, however, in this paper, we extent it to audiovisual speech recognition, and the cross-attention mechanism is introduced into different places ofWaveNet for feature fusion. The proposed cross-attention mechanism tries to explore the correlated frames of visual feature to the acoustic feature frame. The experimental results show that the WaveNet with cross-attention can reduce the Tibetan single syllable error about 4.5% and English word error about 39.8% relative to the audio-only speech recognition, and reduce Tibetan single syllable error about 35.1% and English word error about 21.6% relative to the conventional feature concatenation method for AV-ASR.

Cite

CITATION STYLE

APA

Wang, H., Gao, F., Zhao, Y., & Wu, L. (2020). WaveNet with Cross-Attention for Audiovisual Speech Recognition. IEEE Access, 8, 169160–169168. https://doi.org/10.1109/ACCESS.2020.3024218

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free