Learning transferable node representations for attribute extraction fromweb documents

Yichao Zhou; Ying Sheng; Nguyen Vo; Nick Edmonds; Sandeep Tata

Conference ProceedingsOPEN ACCESS

Learning transferable node representations for attribute extraction fromweb documents

WSDM 2022 - Proceedings of the 15th ACM International Conference on Web Search and Data Mining (2022) 1479-1487

DOI: 10.1145/3488560.3498424

9Citations

12Readers

Abstract

Given a web page, extracting an object along with various attributes of interest (e.g. price, publisher, author, and genre for a book) can facilitate a variety of downstream applications such as large-scale knowledge base construction, e-commerce product search, and personalized recommendation. Prior approaches have either relied on computationally expensive visual feature engineering or required large amounts of training data to get to an acceptable precision. In this paper, we propose a novel method, LeArNing TransfErable node RepresentatioNs for Attribute Extraction (LANTERN), to tackle the problem. We model the problem as a tree node tagging task. The key insight is to learn a contextual representation for each node in the DOM tree where the context explicitly takes into account the tree structure of the neighborhood around the node. Experiments on the SWDE public dataset show that LANTERN outperforms the previous state-of-the-art (SOTA) by 1.44% (F1 score) with a dramatically simpler model architecture. Furthermore, we report that utilizing data from a different domain (for instance, using training data about web pages with cars to extract book objects) is surprisingly useful and helps beat the SOTA by a further 1.37%.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhou, Y., Sheng, Y., Vo, N., Edmonds, N., & Tata, S. (2022). Learning transferable node representations for attribute extraction fromweb documents. In WSDM 2022 - Proceedings of the 15th ACM International Conference on Web Search and Data Mining (pp. 1479–1487). Association for Computing Machinery, Inc. https://doi.org/10.1145/3488560.3498424

Learning transferable node representations for attribute extraction fromweb documents

Abstract

Author supplied keywords

Cite

Register to see more suggestions