Cross-modal embeddings for video and audio retrieval

15Citations
Citations of this article
62Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.

Author supplied keywords

Cite

CITATION STYLE

APA

Surís, D., Duarte, A., Salvador, A., Torres, J., & Giró-i-Nieto, X. (2019). Cross-modal embeddings for video and audio retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11132 LNCS, pp. 711–716). Springer Verlag. https://doi.org/10.1007/978-3-030-11018-5_62

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free