FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

31Citations
Citations of this article
72Readers
Mendeley users who have this article in their library.

Abstract

Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like shopping and movies. Previous approaches have either required a small number of examples for each target site or relied on carefully handcrafted heuristics built over visual renderings of websites. In this paper, we present a novel two-stage neural approach, named FreeDOM, which overcomes both these limitations. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network. By combining these stages, FreeDOM is able to generalize to unseen sites after training on a small number of seed sites from that vertical without requiring expensive hand-crafted features over visual renderings of the page. Through experiments on a public dataset with 8 different verticals, we show that FreeDOM beats the previous state of the art by nearly 3.7 F1 points on average without requiring features over rendered pages or expensive hand-crafted features.

Cite

CITATION STYLE

APA

Lin, B. Y., Sheng, Y., Vo, N., & Tata, S. (2020). FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1092–1102). Association for Computing Machinery. https://doi.org/10.1145/3394486.3403153

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free