Abstract
As an important task in sentiment analysis, Multimodal Aspect-Based Sentiment Analysis (MABSA) has attracted increasing attention in recent years. However, previous approaches either (i) use separately pre-trained visual and textual models, which ignore the cross-modal alignment or (ii) use vision-language models pre-trained with general pre-training tasks, which are inadequate to identify fine-grained aspects, opinions, and their alignments across modalities. To tackle these limitations, we propose a task-specific Vision-Language Pre-training framework for MABSA (VLP-MABSA), which is a unified multimodal encoder-decoder architecture for all the pre-training and downstream tasks. We further design three types of task-specific pre-training tasks from the language, vision, and multimodal modalities, respectively. Experimental results show that our approach generally outperforms the state-of-the-art approaches on three MABSA subtasks. Further analysis demonstrates the effectiveness of each pre-training task. The source code is publicly released at https://github.com/NUSTM/VLP-MABSA.
Cite
CITATION STYLE
Ling, Y., Yu, J., & Xia, R. (2022). Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 2149–2159). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.152
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.