Multimedia analysis of robustly optimized multimodal transformer based on vision and language co-learning

Jun Ho Yoon; Gyu Ho Choi; Chang Choi

Journal Article

Multimedia analysis of robustly optimized multimodal transformer based on vision and language co-learning

Information Fusion (2023) 100

DOI: 10.1016/j.inffus.2023.101922

16Citations

15Readers

Get full text

Abstract

Recently, research on multimodal learning using all modality information has been conducted to detect disinformation on multimedia. Existing multimodal learning methods include score-level fusion approaches combining different models, and feature-level fusion methods combining embedding vectors to integrate data of different dimensions. Because a late-level fusion method is combined after the modalities are individually operated, there is a limit in that the recognition performance of a unimodal determines the performance. In addition, a fusion method has constraints in that the data among the modalities must be matched. In this study, we propose a classification system using a RoBERTa-based multimodal fusion transformer (RoBERTaMFT) that applies a co-learning method to solve the limitations of the recognition performance of multimodal learning as well as the data imbalance among the modalities. RoBERTaMFT consists of image feature extraction, co-learning using the reconstruction of image features with text embedding, and a late-level fusion step applied to the final classification. As experiment results using the CrisisMMD dataset indicate, RoBERTaMFT achieved an accuracy 21.2% and an f1-score 0.414 higher than those of unimodal learning, and an accuracy 11.7% and an f1-score 0.268 higher than those of existing multimodal learning.

Author supplied keywords

Cite

CITATION STYLE

APA

Yoon, J. H., Choi, G. H., & Choi, C. (2023). Multimedia analysis of robustly optimized multimodal transformer based on vision and language co-learning. Information Fusion, 100. https://doi.org/10.1016/j.inffus.2023.101922

Multimedia analysis of robustly optimized multimodal transformer based on vision and language co-learning

Abstract

Author supplied keywords

Cite

Register to see more suggestions