Cross-modal Deep Learning Applications: Audio-Visual Retrieval

Cong Jin; Tian Zhang; Shouxun Liu; Yun Tie; Xin Lv; Jianguang Li; Wencai Yan; Ming Yan; Qian Xu; Yicong Guan; Zhenggougou Yang

Conference Proceedings

Cross-modal Deep Learning Applications: Audio-Visual Retrieval

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2021) 12666 LNCS 301-313

DOI: 10.1007/978-3-030-68780-9_26

4Citations

7Readers

Get full text

Abstract

Recently, deep neural networks have exhibited as a powerful architecture to well capture the nonlinear distribution of high-dimensional multimedia data such as image, video, text and audio, so naturally does for multi-modal data. How to make full use of multimedia data? This leads to an important research direction: cross-modal learning. In this paper, we introduce a method based on the content of audio and video data modalities implemented with a novel two-branch neural network is to learn the joint embeddings from a shared subspace for computing the similarity between the two modalities. In particular, the contribution of proposed method is mainly manifested in the three aspects: i) Using feature selection model for choosing top-k audio and visual feature representation; ii) A novel combination of training loss function concerning inter-modal similarity and intra-modal invariance is used; iii) Due to the lack of video-music paired dataset, we construct dataset of video-music pairs from YouTube 8M and MER31K datasets. The experiments have proved that our proposed model has a better performance compared with other methods.

Author supplied keywords

Cite

CITATION STYLE

APA

Jin, C., Zhang, T., Liu, S., Tie, Y., Lv, X., Li, J., … Yang, Z. (2021). Cross-modal Deep Learning Applications: Audio-Visual Retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12666 LNCS, pp. 301–313). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-68780-9_26

Cross-modal Deep Learning Applications: Audio-Visual Retrieval

Abstract

Author supplied keywords

Cite

Register to see more suggestions