Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Qiushi Zhu; Jie Zhang; Yu Gu; Yuchen Hu; Lirong Dai

Conference ProceedingsOPEN ACCESS

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Proceedings of the AAAI Conference on Artificial Intelligence (2024) 38(17) 19768-19776

DOI: 10.1609/aaai.v38i17.29951

2Citations

9Readers

Abstract

Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multichannel audio streams and a visual stream in parallel, with intra- and inter-channel contrastive losses as training targets to fully exploit the spatiotemporal information in multichannel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of speech representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audiovisual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audiovisual speaker diarization (AVSD) tasks.

Cite

CITATION STYLE

APA

Zhu, Q., Zhang, J., Gu, Y., Hu, Y., & Dai, L. (2024). Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, pp. 19768–19776). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i17.29951

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Abstract

Cite

Register to see more suggestions