Voice Keyword Retrieval Method Using Attention Mechanism and Multimodal Information Fusion

21Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

A cross-modal speech-text retrieval method using interactive learning convolution automatic encoder (CAE) is proposed. First, an interactive learning autoencoder structure is proposed, including two inputs of speech and text, as well as processing links such as encoding, hidden layer interaction, and decoding, to complete the modeling of cross-modal speech-text retrieval. Then, the original audio signal is preprocessed and the Mel frequency cepstrum coefficient (MFCC) feature is extracted. In addition, the word bag model is used to extract the text features, and then the attention mechanism is used to combine the text and speech features. Through interactive learning CAE, the shared features of speech and text modes are obtained and then sent to modal classifier to identify modal information, so as to realize cross-modal voice text retrieval. Finally, experiments show that the performance of the proposed algorithm is better than that of the contrast algorithm in terms ofrecall rate, accuracy rate, and false recognition rate.

Cite

CITATION STYLE

APA

Zhang, H. (2021). Voice Keyword Retrieval Method Using Attention Mechanism and Multimodal Information Fusion. Scientific Programming, 2021. https://doi.org/10.1155/2021/6662841

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free