Cross-modal embeddings for video and audio retrieval

Didac Surís; Amanda Duarte; Amaia Salvador; Jordi Torres; Xavier Giró-i-Nieto

Conference ProceedingsOPEN ACCESS

Cross-modal embeddings for video and audio retrieval

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019) 11132 LNCS 711-716

DOI: 10.1007/978-3-030-11018-5_62

15Citations

62Readers

Abstract

In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.

Author supplied keywords

Cite

CITATION STYLE

APA

Surís, D., Duarte, A., Salvador, A., Torres, J., & Giró-i-Nieto, X. (2019). Cross-modal embeddings for video and audio retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11132 LNCS, pp. 711–716). Springer Verlag. https://doi.org/10.1007/978-3-030-11018-5_62

Cross-modal embeddings for video and audio retrieval

Abstract

Author supplied keywords

Cite

Register to see more suggestions