Visual conversation has recently emerged as a research area in the visually-grounded language modeling domain. It requires an intelligent agent to maintain a natural language conversation with humans about visual content. Its main difference from traditional visual question answering is that the agent must infer the answer not only by grounding the question in the image, but also from the context of the conversation history. In this paper we propose a novel multimodal attention architecture that enables the conversation agent to focus on parts of the conversation history and specific image regions to infer the answer based on the conversation context. We evaluate our model on the VisDial dataset and demonstrate that it performs better than current state of the art.
CITATION STYLE
Kodra, L., & Meçe, E. K. (2018). Multimodal attention agents in visual conversation. In Lecture Notes on Data Engineering and Communications Technologies (Vol. 17, pp. 584–596). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-75928-9_52
Mendeley helps you to discover research relevant for your work.