Multimodal Sentence Summarization via Multimodal Selective Encoding

41Citations
Citations of this article
79Readers
Mendeley users who have this article in their library.

Abstract

This paper studies the problem of generating a summary for a given sentence-image pair. Existing multimodal sequence-to-sequence approaches mainly focus on enhancing the decoder by visual signals, while ignoring that the image can improve the ability of the encoder to identify highlights of a news event or a document. Thus, we propose a multimodal selective gate network that considers reciprocal relationships between textual and multi-level visual features, including global image descriptor, activation grids, and object proposals, to select highlights of the event when encoding the source sentence. In addition, we introduce a modality regularization to encourage the summary to capture the highlights embedded in the image more accurately. To verify the generalization of our model, we adopt the multimodal selective gate to the text-based decoder and multimodal-based decoder. Experimental results on a public multimodal sentence summarization dataset demonstrate the advantage of our models over baselines. Further analysis suggests that our proposed multimodal selective gate network can effectively select important information in the input sentence.

Cite

CITATION STYLE

APA

Li, H., Zhu, J., Zhang, J., Zong, C., & He, X. (2020). Multimodal Sentence Summarization via Multimodal Selective Encoding. In COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference (pp. 5655–5667). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.coling-main.496

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free