Extracting general lists from web documents: A hybrid approach

Fabio Fumarola; Tim Weninger; Rick Barber; Donato Malerba; Jiawei Han

Conference Proceedings

Extracting general lists from web documents: A hybrid approach

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2011) 6703 LNAI(PART 1) 285-294

DOI: 10.1007/978-3-642-21822-4_29

19Citations

17Readers

Get full text

Abstract

The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of a Web page independently do not generalize well. We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods across a varied Web corpus. © 2011 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Fumarola, F., Weninger, T., Barber, R., Malerba, D., & Han, J. (2011). Extracting general lists from web documents: A hybrid approach. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6703 LNAI, pp. 285–294). https://doi.org/10.1007/978-3-642-21822-4_29

Extracting general lists from web documents: A hybrid approach

Abstract

Author supplied keywords

Cite

Register to see more suggestions