In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.
CITATION STYLE
Surís, D., Duarte, A., Salvador, A., Torres, J., & Giró-i-Nieto, X. (2019). Cross-modal embeddings for video and audio retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11132 LNCS, pp. 711–716). Springer Verlag. https://doi.org/10.1007/978-3-030-11018-5_62
Mendeley helps you to discover research relevant for your work.