Multimodal attention agents in visual conversation

0Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Visual conversation has recently emerged as a research area in the visually-grounded language modeling domain. It requires an intelligent agent to maintain a natural language conversation with humans about visual content. Its main difference from traditional visual question answering is that the agent must infer the answer not only by grounding the question in the image, but also from the context of the conversation history. In this paper we propose a novel multimodal attention architecture that enables the conversation agent to focus on parts of the conversation history and specific image regions to infer the answer based on the conversation context. We evaluate our model on the VisDial dataset and demonstrate that it performs better than current state of the art.

Cite

CITATION STYLE

APA

Kodra, L., & Meçe, E. K. (2018). Multimodal attention agents in visual conversation. In Lecture Notes on Data Engineering and Communications Technologies (Vol. 17, pp. 584–596). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-75928-9_52

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free