Cascaded mutual modulation for visual reasoning

14Citations
Citations of this article
108Readers
Mendeley users who have this article in their library.

Abstract

Visual reasoning is a special visual question answering problem that is multi-step and compositional by nature, and also requires intensive text-vision interactions. We propose CMM: Cascaded Mutual Modulation as a novel end-to-end visual reasoning model. CMM includes a multi-step comprehension process for both question and image. In each step, we use a Feature-wise Linear Modulation (FiLM) technique to enable textual/visual pipeline to mutually control each other. Experiments show that CMM significantly outperforms most related models, and reach state-of-the-arts on two visual reasoning benchmarks: CLEVR and NLVR, collected from both synthetic and natural languages. Ablation studies confirm that both our multi-step framework and our visual-guided language modulation are critical to the task. Our code is available at https://github.com/FlamingHorizon/CMM-VR.

Cite

CITATION STYLE

APA

Yao, Y., Xu, J., Wang, F., & Xu, B. (2018). Cascaded mutual modulation for visual reasoning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 (pp. 975–980). Association for Computational Linguistics. https://doi.org/10.18653/v1/d18-1118

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free