Abstract
In the rapidly evolving financial landscape, sentiment analysis has emerged as a critical tool for decoding market dynamics, yet traditional approaches remain confined to textual data, overlooking the rich multimodal cues embedded in audio and video. This paper unveils a pioneering zero-shot framework that harnesses Multimodal Large Language Models (MLLMs) to revolutionize sentiment-driven investment by integrating text, audio, and video modalities. We introduce a comprehensive suite of metrics to extract nuanced emotional signals, a self-consistent signal verification mechanism to enhance market prediction reliability, and a JSON schema for seamless automation. To validate this innovation, we curate the White House Press Briefing (WHPB) Video Benchmark Database, a novel dataset of 30 press briefings from January to July 2025, offering a robust testbed for multimodal analysis. Our extensive experiments demonstrate that the full-multimodal approach, leveraging text, audio, and video, outperforms text-only and text-audio baselines, achieving superior returns across diverse assets, including a remarkable 2,843.9% annualized return on the VIX. This work not only redefines financial sentiment analysis but also sets a transformative foundation for AI-driven investment strategies, empowering investors with unprecedented insights into market sentiment. Our WHPH database is available at https://github.com/sutan244/White-House-Press-Briefing-Video-Benchmark-Dataset-WHPB.
Author supplied keywords
Cite
CITATION STYLE
Tan, S., So, C. C., Sun, Y., Wang, J. M., Loh, W. K. A., & Yung, S. P. (2025). Vision, Voice, and Text: Pioneering Zero-shot Multimodal LLMs for Sentiment-driven Investment. In ICAIF 2025 - 6th ACM International Conference on AI in Finance (pp. 960–968). Association for Computing Machinery, Inc. https://doi.org/10.1145/3768292.3770368
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.