We introduce an ensemble model approach for multimodal sentiment analysis, focusing on the fusion of textual and video data to enhance the accuracy and depth of emotion interpretation. By integrating three foundational models—IFFSA, BFSA, and TBJE—using advanced ensemble techniques, we achieve a significant improvement in sentiment analysis performance across diverse datasets, including MOSI and MOSEI. Specifically, we propose two novel models—IFFSA and BFSA, which utilise the large language models BERT and GPT-2 to extract the features from text modality and ResNet and VGG for video modality. Our work uniquely contributes to the field by demonstrating the synergistic potential of combining different modal analytical strengths, thereby addressing the intricate challenge of nuanced emotion detection in multimodal contexts. Through comprehensive experiments and an extensive ablation study, we not only validate the superior performance of our ensemble model against current state-of-the-art benchmarks but also reveal critical insights into the model’s capability to discern complex emotional states. Our findings underscore the strategic advantage of ensemble methods in multimodal sentiment analysis and set a new precedent for future research in effectively integrating multimodal data sources.
CITATION STYLE
Liu, Z., Braytee, A., Anaissi, A., Zhang, G., Qin, L., & Akram, J. (2024). Ensemble Pretrained Models for Multimodal Sentiment Analysis using Textual and Video Data Fusion. In WWW 2024 Companion - Companion Proceedings of the ACM Web Conference (pp. 1841–1848). Association for Computing Machinery, Inc. https://doi.org/10.1145/3589335.3651971
Mendeley helps you to discover research relevant for your work.