Cross-modal Deep Learning Applications: Audio-Visual Retrieval

4Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Recently, deep neural networks have exhibited as a powerful architecture to well capture the nonlinear distribution of high-dimensional multimedia data such as image, video, text and audio, so naturally does for multi-modal data. How to make full use of multimedia data? This leads to an important research direction: cross-modal learning. In this paper, we introduce a method based on the content of audio and video data modalities implemented with a novel two-branch neural network is to learn the joint embeddings from a shared subspace for computing the similarity between the two modalities. In particular, the contribution of proposed method is mainly manifested in the three aspects: i) Using feature selection model for choosing top-k audio and visual feature representation; ii) A novel combination of training loss function concerning inter-modal similarity and intra-modal invariance is used; iii) Due to the lack of video-music paired dataset, we construct dataset of video-music pairs from YouTube 8M and MER31K datasets. The experiments have proved that our proposed model has a better performance compared with other methods.

Cite

CITATION STYLE

APA

Jin, C., Zhang, T., Liu, S., Tie, Y., Lv, X., Li, J., … Yang, Z. (2021). Cross-modal Deep Learning Applications: Audio-Visual Retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12666 LNCS, pp. 301–313). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-68780-9_26

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free