Visually extracting data records from query result pages

Neil Anderson; Jun Hong

Conference Proceedings

Visually extracting data records from query result pages

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2013) 7808 LNCS 392-403

DOI: 10.1007/978-3-642-37401-2_40

3Citations

7Readers

Get full text

Abstract

Web databases are now pervasive. Query result pages are dynamically generated from these databases in response to user-submitted queries. Automatically extracting structured data from query result pages is a challenging problem, as the structure of the data is not explicitly represented. While humans have shown good intuition in visually understanding data records on a query result page as displayed by a web browser, no existing approach to data record extraction has made full use of this intuition. We propose a novel approach, in which we make use of the common sources of evidence that humans use to understand data records on a displayed query result page. These include structural regularity, and visual and content similarity between data records displayed on a query result page. Based on these observations we propose new techniques that can identify each data record individually, while ignoring noise items, such as navigation bars and adverts. We have implemented these techniques in a software prototype, rExtractor, and tested it using two datasets. Our experimental results show that our approach achieves significantly higher accuracy than previous approaches. © 2013 Springer-Verlag.

Cite

CITATION STYLE

APA

Anderson, N., & Hong, J. (2013). Visually extracting data records from query result pages. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7808 LNCS, pp. 392–403). https://doi.org/10.1007/978-3-642-37401-2_40

Visually extracting data records from query result pages

Abstract

Cite

Register to see more suggestions