NMN-VD: A neural module network for visual dialog

5Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

Visual dialog demonstrates several important aspects of multimodal artificial intelligence; however, it is hindered by visual grounding and visual coreference resolution problems. To over-come these problems, we propose the novel neural module network for visual dialog (NMN-VD). NMN-VD is an efficient question-customized modular network model that combines only the mod-ules required for deciding answers after analyzing input questions. In particular, the model includes a Refer module that effectively finds the visual area indicated by a pronoun using a reference pool to solve a visual coreference resolution problem, which is an important challenge in visual dialog. In addition, the proposed NMN-VD model includes a method for distinguishing and handling im-personal pronouns that do not require visual coreference resolution from general pronouns. Fur-thermore, a new Compare module that effectively handles comparison questions found in visual dialogs is included in the model, as well as a Find module that applies a triple-attention mechanism to solve visual grounding problems between the question and the image. The results of various experiments conducted using a set of large-scale benchmark data verify the efficacy and high performance of our proposed NMN-VD model.

Cite

CITATION STYLE

APA

Cho, Y., & Kim, I. (2021). NMN-VD: A neural module network for visual dialog. Sensors (Switzerland), 21(3), 1–18. https://doi.org/10.3390/s21030931

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free