Look, Listen and Infer

2Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Inspired by the ability of human beings on recognizing the relations between visual scenes and sounds, many cross-modal learning methods have been developed for modeling images or videos and associated sounds. In this work, for the first time, a Look, Listen and Infer Network (LLINet) is proposed to learn a zero-shot model that can infer the relations of visual scenes and sounds from novel categories never appeared before. LLINet is mainly desired to qualify for two tasks, i.e., image-audio cross-modal retrieval and sound localization in each image. Towards this end, it is designed as a two-branch encoding network that builds a common space for images and audios. Besides, a cross-modal attention mechanism is proposed in LLINet to localize sound objects. To evaluate LLINet, a new data set, named INSTRUMENT-32CLASS, is collected in this work. Besides zero-shot cross-modal retrieval and sound localization, a zero-shot image recognition task based on sounds is also conducted on this database. All experimental results on these tasks demonstrate the effectiveness of LLINet, indicating that zero-shot learning for visual scenes and sounds is feasible. The project page for LLINet is available at https://llinet.github.io/.

Cite

CITATION STYLE

APA

Jia, R., Wang, X., Pang, S., Zhu, J., & Xue, J. (2020). Look, Listen and Infer. In MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (pp. 3911–3919). Association for Computing Machinery, Inc. https://doi.org/10.1145/3394171.3414023

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free