Inspired by the ability of human beings on recognizing the relations between visual scenes and sounds, many cross-modal learning methods have been developed for modeling images or videos and associated sounds. In this work, for the first time, a Look, Listen and Infer Network (LLINet) is proposed to learn a zero-shot model that can infer the relations of visual scenes and sounds from novel categories never appeared before. LLINet is mainly desired to qualify for two tasks, i.e., image-audio cross-modal retrieval and sound localization in each image. Towards this end, it is designed as a two-branch encoding network that builds a common space for images and audios. Besides, a cross-modal attention mechanism is proposed in LLINet to localize sound objects. To evaluate LLINet, a new data set, named INSTRUMENT-32CLASS, is collected in this work. Besides zero-shot cross-modal retrieval and sound localization, a zero-shot image recognition task based on sounds is also conducted on this database. All experimental results on these tasks demonstrate the effectiveness of LLINet, indicating that zero-shot learning for visual scenes and sounds is feasible. The project page for LLINet is available at https://llinet.github.io/.
CITATION STYLE
Jia, R., Wang, X., Pang, S., Zhu, J., & Xue, J. (2020). Look, Listen and Infer. In MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (pp. 3911–3919). Association for Computing Machinery, Inc. https://doi.org/10.1145/3394171.3414023
Mendeley helps you to discover research relevant for your work.