WUKONG-READER: Multi-modal Pre-training for Fine-grained Visual Document Understanding

Haoli Bai; Zhiguang Liu; Xiaojun Meng; Wentao Li; Shuang Liu; Yifeng Luo; Nian Xie; Rongfu Zheng; Liangwei Wang; Lu Hou; Jiansheng Wei; Xin Jiang; Qun Liu

Conference Proceedings

WUKONG-READER: Multi-modal Pre-training for Fine-grained Visual Document Understanding

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 1 13386-13401

DOI: 10.18653/v1/2023.acl-long.748

8Citations

28Readers

Get full text

Abstract

Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding (VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose WUKONG-READER, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that WUKONG-READER brings superior performance on various VDU tasks in both English and Chinese. The fine-grained alignment over textlines also empowers WUKONG-READER with promising localization ability.

Cite

CITATION STYLE

APA

Bai, H., Liu, Z., Meng, X., Li, W., Liu, S., Luo, Y., … Liu, Q. (2023). WUKONG-READER: Multi-modal Pre-training for Fine-grained Visual Document Understanding. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 13386–13401). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.748

WUKONG-READER: Multi-modal Pre-training for Fine-grained Visual Document Understanding

Abstract

Cite

Register to see more suggestions