Video dialog via progressive inference and cross-transformer

Weike Jin; Zhou Zhao; Mao Gu; Jun Xiao; Furu Wei; Yueting Zhuang

Conference ProceedingsOPEN ACCESS

Video dialog via progressive inference and cross-transformer

EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (2019) 2109-2118

DOI: 10.18653/v1/D19-1217

4Citations

90Readers

Abstract

Video dialog is a new and challenging task, which requires the agent to answer questions combining video information with dialog history. And different from single-turn video question answering, the additional dialog history is important for video dialog, which often includes contextual information for the question. Existing visual dialog methods mainly use RNN to encode the dialog history as a single vector representation, which might be rough and straightforward. Some more advanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit reasoning process. In this paper, we introduce a novel progressive inference mechanism for video dialog, which progressively updates query information based on dialog history and video content until the agent think the information is sufficient and unambiguous. In order to tackle the multimodal fusion problem, we propose a cross-transformer module, which could learn more fine-grained and comprehensive interactions both inside and between the modalities. And besides answer generation, we also consider question generation, which is more challenging but significant for a complete video dialog system. We evaluate our method on two large-scale datasets, and the extensive experiments show the effectiveness of our method.

Cite

CITATION STYLE

APA

Jin, W., Zhao, Z., Gu, M., Xiao, J., Wei, F., & Zhuang, Y. (2019). Video dialog via progressive inference and cross-transformer. In EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (pp. 2109–2118). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1217

Video dialog via progressive inference and cross-transformer

Abstract

Cite

Register to see more suggestions