Vision, Voice, and Text: Pioneering Zero-shot Multimodal LLMs for Sentiment-driven Investment

Su Tan; Chi Chiu So; Yueyue Sun; Jun Min Wang; Wai Keung Anthony Loh; Siu Pang Yung

Conference Proceedings

Vision, Voice, and Text: Pioneering Zero-shot Multimodal LLMs for Sentiment-driven Investment

ICAIF 2025 - 6th ACM International Conference on AI in Finance (2025) 960-968

DOI: 10.1145/3768292.3770368

0Citations

4Readers

Get full text

Abstract

In the rapidly evolving financial landscape, sentiment analysis has emerged as a critical tool for decoding market dynamics, yet traditional approaches remain confined to textual data, overlooking the rich multimodal cues embedded in audio and video. This paper unveils a pioneering zero-shot framework that harnesses Multimodal Large Language Models (MLLMs) to revolutionize sentiment-driven investment by integrating text, audio, and video modalities. We introduce a comprehensive suite of metrics to extract nuanced emotional signals, a self-consistent signal verification mechanism to enhance market prediction reliability, and a JSON schema for seamless automation. To validate this innovation, we curate the White House Press Briefing (WHPB) Video Benchmark Database, a novel dataset of 30 press briefings from January to July 2025, offering a robust testbed for multimodal analysis. Our extensive experiments demonstrate that the full-multimodal approach, leveraging text, audio, and video, outperforms text-only and text-audio baselines, achieving superior returns across diverse assets, including a remarkable 2,843.9% annualized return on the VIX. This work not only redefines financial sentiment analysis but also sets a transformative foundation for AI-driven investment strategies, empowering investors with unprecedented insights into market sentiment. Our WHPH database is available at https://github.com/sutan244/White-House-Press-Briefing-Video-Benchmark-Dataset-WHPB.

Author supplied keywords

Cite

CITATION STYLE

APA

Tan, S., So, C. C., Sun, Y., Wang, J. M., Loh, W. K. A., & Yung, S. P. (2025). Vision, Voice, and Text: Pioneering Zero-shot Multimodal LLMs for Sentiment-driven Investment. In ICAIF 2025 - 6th ACM International Conference on AI in Finance (pp. 960–968). Association for Computing Machinery, Inc. https://doi.org/10.1145/3768292.3770368

Vision, Voice, and Text: Pioneering Zero-shot Multimodal LLMs for Sentiment-driven Investment

Abstract

Author supplied keywords

Cite

Register to see more suggestions