Separate and Locate: Rethink the Text in Text-based Visual Question Answering

Chengyang Fang; Jiangnan Li; Liang Li; Can Ma; Dayong Hu

Conference ProceedingsOPEN ACCESS

Separate and Locate: Rethink the Text in Text-based Visual Question Answering

MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia (2023) 4378-4388

DOI: 10.1145/3581783.3611753

6Citations

5Readers

Abstract

Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ''sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, we propose a Text Semantic Separate (TSS) module that helps the model recognize whether words have semantic contextual relations. Then, we introduce a Spatial Circle Position (SCP) module that helps the model better construct and reason the spatial position relationships between OCR texts. Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets. Compared with the pre-training state-of-the-art method pre-trained on 64 million pre-training samples, our method, without any pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on TextVQA and ST-VQA. Our code and models will be released at https://github.com/fangbufang/SaL.

Author supplied keywords

Cite

CITATION STYLE

APA

Fang, C., Li, J., Li, L., Ma, C., & Hu, D. (2023). Separate and Locate: Rethink the Text in Text-based Visual Question Answering. In MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia (pp. 4378–4388). Association for Computing Machinery, Inc. https://doi.org/10.1145/3581783.3611753

Separate and Locate: Rethink the Text in Text-based Visual Question Answering

Abstract

Author supplied keywords

Cite

Register to see more suggestions