Enhancing Multimodal Understanding With LIUS: A Novel Framework for Visual Question Answering in Digital Marketing

2Citations
Citations of this article
44Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

VQA (visual question and answer) is the task of enabling a computer to generate accurate textual answers based on given images and related questions. It integrates computer vision and natural language processing and requires a model that is able to understand not only the image content but also the question in order to generate appropriate linguistic answers. However, current limitations in cross-modal understanding often result in models that struggle to accurately capture the complex relationships between images and questions, leading to inaccurate or ambiguous answers. This research aims to address this challenge through a multifaceted approach that combines the strengths of vision and language processing. By introducing the innovative LIUS framework, a specialized vision module was built to process image information and fuse features using multiple scales. The insights gained from this module are integrated with a “reasoning module” (LLM) to generate answers.

Cite

CITATION STYLE

APA

Song, C. (2024). Enhancing Multimodal Understanding With LIUS: A Novel Framework for Visual Question Answering in Digital Marketing. Journal of Organizational and End User Computing, 36(1). https://doi.org/10.4018/JOEUC.336276

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free