Abstract
Multi-modal named entity recognition (MNER) aims to discover named entities in free text and classify them into predefined types with images. However, dominant MNER models do not fully exploit fine-grained semantic correspondences between semantic units of different modalities, which have the potential to refine multi-modal representation learning. To deal with this issue, we propose a unified multi-modal graph fusion (UMGF) approach for MNER. Specifically, we first represent the input sentence and image using a unified multi-modal graph, which captures various semantic relationships between multi-modal semantic units (words and visual objects). Then, we stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations. Finally, we achieve an attentionbased multi-modal representation for each word and perform entity labeling with a CRF decoder. Experimentation on the two benchmark datasets demonstrates the superiority of our MNER model.
Cite
CITATION STYLE
Zhang, D., Wei, S., Li, S., Wu, H., Zhu, Q., & Zhou, G. (2021). Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In 35th AAAI Conference on Artificial Intelligence, AAAI 2021 (Vol. 16, pp. 14347–14355). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v35i16.17687
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.