Abstract
VQA (visual question and answer) is the task of enabling a computer to generate accurate textual answers based on given images and related questions. It integrates computer vision and natural language processing and requires a model that is able to understand not only the image content but also the question in order to generate appropriate linguistic answers. However, current limitations in cross-modal understanding often result in models that struggle to accurately capture the complex relationships between images and questions, leading to inaccurate or ambiguous answers. This research aims to address this challenge through a multifaceted approach that combines the strengths of vision and language processing. By introducing the innovative LIUS framework, a specialized vision module was built to process image information and fuse features using multiple scales. The insights gained from this module are integrated with a “reasoning module” (LLM) to generate answers.
Author supplied keywords
Cite
CITATION STYLE
Song, C. (2024). Enhancing Multimodal Understanding With LIUS: A Novel Framework for Visual Question Answering in Digital Marketing. Journal of Organizational and End User Computing, 36(1). https://doi.org/10.4018/JOEUC.336276
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.