Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

Yan Ling; Jianfei Yu; Rui Xia

Conference ProceedingsOPEN ACCESS

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022) 1 2149-2159

DOI: 10.18653/v1/2022.acl-long.152

118Citations

113Readers

Abstract

As an important task in sentiment analysis, Multimodal Aspect-Based Sentiment Analysis (MABSA) has attracted increasing attention in recent years. However, previous approaches either (i) use separately pre-trained visual and textual models, which ignore the cross-modal alignment or (ii) use vision-language models pre-trained with general pre-training tasks, which are inadequate to identify fine-grained aspects, opinions, and their alignments across modalities. To tackle these limitations, we propose a task-specific Vision-Language Pre-training framework for MABSA (VLP-MABSA), which is a unified multimodal encoder-decoder architecture for all the pre-training and downstream tasks. We further design three types of task-specific pre-training tasks from the language, vision, and multimodal modalities, respectively. Experimental results show that our approach generally outperforms the state-of-the-art approaches on three MABSA subtasks. Further analysis demonstrates the effectiveness of each pre-training task. The source code is publicly released at https://github.com/NUSTM/VLP-MABSA.

Cite

CITATION STYLE

APA

Ling, Y., Yu, J., & Xia, R. (2022). Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 2149–2159). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.152

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

Abstract

Cite

Register to see more suggestions