MRD-Net: Multi-Modal Residual Knowledge Distillation for Spoken Question Answering

23Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.

Abstract

Spoken question answering (SQA) has recently drawn considerable attention in the speech community. It requires systems to find correct answers from the given spoken passages simultaneously. The common SQA systems consist of the automatic speech recognition (ASR) module and text-based question answering module. However, previous methods suffer from severe performance degradation due to ASR errors. To alleviate this problem, this work proposes a novel multi-modal residual knowledge distillation method (MRD-Net), which further distills knowledge at the acoustic level from the audio-assistant (Audio-A). Specifically, we utilize the teacher (T) trained on manual transcriptions to guide the training of the student (S) on ASR transcriptions. We also show that introducing an Audio-A helps this procedure by learning residual errors between T and S. Moreover, we propose a simple yet effective attention mechanism to adaptively leverage audio-text features as the new deep attention knowledge to boost the network performance. Extensive experiments demonstrate that the proposed MRD-Net achieves superior results compared with state-of-the-art methods on three spoken question answering benchmark datasets.

Cite

CITATION STYLE

APA

You, C., Chen, N., & Zou, Y. (2021). MRD-Net: Multi-Modal Residual Knowledge Distillation for Spoken Question Answering. In IJCAI International Joint Conference on Artificial Intelligence (pp. 3985–3991). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2021/549

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free