The task of web information extraction is to extract target fields of an object from web pages, such as extracting the name, genre and actor from a movie page. Recent sequential modeling approaches have achieved state-of-the-art results on web information extraction. However, most of these methods only focus on extracting information from textual sources while ignoring the rich information from other modalities such as image and web layout. In this work, we propose a novel MUltimodal Structural Transformer (MUST) that incorporates multiple modalities for web information extraction. Concretely, we develop a structural encoder that jointly encodes the multimodal information based on the HTML structure of the web layout, where high-level DOM nodes, low-level text, and image tokens are introduced to represent the entire page. Structural attention patterns are designed to learn effective cross-modal embeddings for all DOM nodes and low-level tokens. An extensive set of experiments has been conducted on WebSRC and Common Crawl benchmarks. Experimental results demonstrate the superior performance of MUST over several state-of-the-art baselines.
CITATION STYLE
Wang, Q., Wang, J., Quan, X., Feng, F., Xu, Z., Nie, S., … Liu, D. (2023). MUSTIE: Multimodal Structural Transformer for Web Information Extraction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 2405–2420). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.135
Mendeley helps you to discover research relevant for your work.