Cascaded mutual modulation for visual reasoning

Yiqun Yao; Jiaming Xu; Feng Wang; Bo Xu

Conference ProceedingsOPEN ACCESS

Cascaded mutual modulation for visual reasoning

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 (2018) 975-980

DOI: 10.18653/v1/d18-1118

14Citations

107Readers

Abstract

Visual reasoning is a special visual question answering problem that is multi-step and compositional by nature, and also requires intensive text-vision interactions. We propose CMM: Cascaded Mutual Modulation as a novel end-to-end visual reasoning model. CMM includes a multi-step comprehension process for both question and image. In each step, we use a Feature-wise Linear Modulation (FiLM) technique to enable textual/visual pipeline to mutually control each other. Experiments show that CMM significantly outperforms most related models, and reach state-of-the-arts on two visual reasoning benchmarks: CLEVR and NLVR, collected from both synthetic and natural languages. Ablation studies confirm that both our multi-step framework and our visual-guided language modulation are critical to the task. Our code is available at https://github.com/FlamingHorizon/CMM-VR.

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Yao, Y., Xu, J., Wang, F., & Xu, B. (2018). Cascaded mutual modulation for visual reasoning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 (pp. 975–980). Association for Computational Linguistics. https://doi.org/10.18653/v1/d18-1118

Readers' Seniority

PhD / Post grad / Masters / Doc 36

73%

Researcher 8

16%

Professor / Associate Prof. 3

6%

Lecturer / Post doc 2

4%

Readers' Discipline

Computer Science 46

82%

Linguistics 5

9%

Engineering 3

5%

Social Sciences 2

4%

Cascaded mutual modulation for visual reasoning

Abstract

References Powered by Scopus

Deep residual learning for image recognition

ImageNet Large Scale Visual Recognition Challenge

VQA: Visual question answering

Cited by Powered by Scopus

Language-conditioned graph networks for relational reasoning

Trends in integration of vision and language research: A survey of tasks, datasets, and methods

KM4: Visual reasoning via Knowledge Embedding Memory Model with Mutual Modulation

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline