FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

8Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.

Abstract

The recent advent of self-supervised pretraining techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multitask tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.

Cite

CITATION STYLE

APA

Lee, C. Y., Li, C. L., Zhang, H., Dozat, T., Perot, V., Su, G., … Pfister, T. (2023). FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 9011–9026). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.501

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free