Cascaded mutual modulation for visual reasoning

14Citations
Citations of this article
107Readers
Mendeley users who have this article in their library.

Abstract

Visual reasoning is a special visual question answering problem that is multi-step and compositional by nature, and also requires intensive text-vision interactions. We propose CMM: Cascaded Mutual Modulation as a novel end-to-end visual reasoning model. CMM includes a multi-step comprehension process for both question and image. In each step, we use a Feature-wise Linear Modulation (FiLM) technique to enable textual/visual pipeline to mutually control each other. Experiments show that CMM significantly outperforms most related models, and reach state-of-the-arts on two visual reasoning benchmarks: CLEVR and NLVR, collected from both synthetic and natural languages. Ablation studies confirm that both our multi-step framework and our visual-guided language modulation are critical to the task. Our code is available at https://github.com/FlamingHorizon/CMM-VR.

References Powered by Scopus

Deep residual learning for image recognition

198563Citations
N/AReaders
Get full text

ImageNet Large Scale Visual Recognition Challenge

32934Citations
N/AReaders
Get full text

VQA: Visual question answering

4204Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Language-conditioned graph networks for relational reasoning

150Citations
N/AReaders
Get full text

Trends in integration of vision and language research: A survey of tasks, datasets, and methods

70Citations
N/AReaders
Get full text

KM4: Visual reasoning via Knowledge Embedding Memory Model with Mutual Modulation

34Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Yao, Y., Xu, J., Wang, F., & Xu, B. (2018). Cascaded mutual modulation for visual reasoning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 (pp. 975–980). Association for Computational Linguistics. https://doi.org/10.18653/v1/d18-1118

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 36

73%

Researcher 8

16%

Professor / Associate Prof. 3

6%

Lecturer / Post doc 2

4%

Readers' Discipline

Tooltip

Computer Science 46

82%

Linguistics 5

9%

Engineering 3

5%

Social Sciences 2

4%

Save time finding and organizing research with Mendeley

Sign up for free