MUSTIE: Multimodal Structural Transformer for Web Information Extraction

9Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.

Abstract

The task of web information extraction is to extract target fields of an object from web pages, such as extracting the name, genre and actor from a movie page. Recent sequential modeling approaches have achieved state-of-the-art results on web information extraction. However, most of these methods only focus on extracting information from textual sources while ignoring the rich information from other modalities such as image and web layout. In this work, we propose a novel MUltimodal Structural Transformer (MUST) that incorporates multiple modalities for web information extraction. Concretely, we develop a structural encoder that jointly encodes the multimodal information based on the HTML structure of the web layout, where high-level DOM nodes, low-level text, and image tokens are introduced to represent the entire page. Structural attention patterns are designed to learn effective cross-modal embeddings for all DOM nodes and low-level tokens. An extensive set of experiments has been conducted on WebSRC and Common Crawl benchmarks. Experimental results demonstrate the superior performance of MUST over several state-of-the-art baselines.

Cite

CITATION STYLE

APA

Wang, Q., Wang, J., Quan, X., Feng, F., Xu, Z., Nie, S., … Liu, D. (2023). MUSTIE: Multimodal Structural Transformer for Web Information Extraction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 2405–2420). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.135

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free