Learning transferable node representations for attribute extraction fromweb documents

9Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.

Abstract

Given a web page, extracting an object along with various attributes of interest (e.g. price, publisher, author, and genre for a book) can facilitate a variety of downstream applications such as large-scale knowledge base construction, e-commerce product search, and personalized recommendation. Prior approaches have either relied on computationally expensive visual feature engineering or required large amounts of training data to get to an acceptable precision. In this paper, we propose a novel method, LeArNing TransfErable node RepresentatioNs for Attribute Extraction (LANTERN), to tackle the problem. We model the problem as a tree node tagging task. The key insight is to learn a contextual representation for each node in the DOM tree where the context explicitly takes into account the tree structure of the neighborhood around the node. Experiments on the SWDE public dataset show that LANTERN outperforms the previous state-of-the-art (SOTA) by 1.44% (F1 score) with a dramatically simpler model architecture. Furthermore, we report that utilizing data from a different domain (for instance, using training data about web pages with cars to extract book objects) is surprisingly useful and helps beat the SOTA by a further 1.37%.

Cite

CITATION STYLE

APA

Zhou, Y., Sheng, Y., Vo, N., Edmonds, N., & Tata, S. (2022). Learning transferable node representations for attribute extraction fromweb documents. In WSDM 2022 - Proceedings of the 15th ACM International Conference on Web Search and Data Mining (pp. 1479–1487). Association for Computing Machinery, Inc. https://doi.org/10.1145/3488560.3498424

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free