Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

118Citations
Citations of this article
113Readers
Mendeley users who have this article in their library.

Abstract

As an important task in sentiment analysis, Multimodal Aspect-Based Sentiment Analysis (MABSA) has attracted increasing attention in recent years. However, previous approaches either (i) use separately pre-trained visual and textual models, which ignore the cross-modal alignment or (ii) use vision-language models pre-trained with general pre-training tasks, which are inadequate to identify fine-grained aspects, opinions, and their alignments across modalities. To tackle these limitations, we propose a task-specific Vision-Language Pre-training framework for MABSA (VLP-MABSA), which is a unified multimodal encoder-decoder architecture for all the pre-training and downstream tasks. We further design three types of task-specific pre-training tasks from the language, vision, and multimodal modalities, respectively. Experimental results show that our approach generally outperforms the state-of-the-art approaches on three MABSA subtasks. Further analysis demonstrates the effectiveness of each pre-training task. The source code is publicly released at https://github.com/NUSTM/VLP-MABSA.

Cite

CITATION STYLE

APA

Ling, Y., Yu, J., & Xia, R. (2022). Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 2149–2159). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.152

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free