Suggesting sounds for images from video collections

Matthias Solèr; Jean Charles Bazin; Oliver Wang; Andreas Krause; Alexander Sorkine-Hornung

Conference ProceedingsOPEN ACCESS

Suggesting sounds for images from video collections

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016) 9914 LNCS 900-917

DOI: 10.1007/978-3-319-48881-3_59

6Citations

17Readers

Abstract

Given a still image, humans can easily think of a sound associated with this image. For instance, people might associate the picture of a car with the sound of a car engine. In this paper we aim to retrieve sounds corresponding to a query image. To solve this challenging task, our approach exploits the correlation between the audio and visual modalities in video collections. A major difficulty is the high amount of uncorrelated audio in the videos, i.e., audio that does not correspond to the main image content, such as voice-over, background music, added sound effects, or sounds originating off-screen. We present an unsupervised, clustering-based solution that is able to automatically separate correlated sounds from uncorrelated ones. The core algorithm is based on a joint audio-visual feature space, in which we perform iterated mutual kNN clustering in order to effectively filter out uncorrelated sounds. To this end we also introduce a new dataset of correlated audio-visual data, on which we evaluate our approach and compare it to alternative solutions. Experiments show that our approach can successfully deal with a high amount of uncorrelated audio.

Author supplied keywords

Cite

CITATION STYLE

APA

Solèr, M., Bazin, J. C., Wang, O., Krause, A., & Sorkine-Hornung, A. (2016). Suggesting sounds for images from video collections. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9914 LNCS, pp. 900–917). Springer Verlag. https://doi.org/10.1007/978-3-319-48881-3_59

Suggesting sounds for images from video collections

Abstract

Author supplied keywords

Cite

Register to see more suggestions