Information extraction from webpages based on DOM distances

Carlos Castillo; Héctor Valero; José Guadalupe Ramos; Josep Silva

Conference Proceedings

Information extraction from webpages based on DOM distances

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7182 LNCS(PART 2) 181-193

DOI: 10.1007/978-3-642-28601-8_16

4Citations

4Readers

Get full text

Abstract

Retrieving information from Internet is a difficult task as it is demonstrated by the lack of real-time tools able to extract information from webpages. The main cause is that most webpages in Internet are implemented using plain (X)HTML which is a language that lacks structured semantic information. For this reason much of the efforts in this area have been directed to the development of techniques for URLs extraction. This field has produced good results implemented by modern search engines. But, contrarily, extracting information from a single webpage has produced poor results or very limited tools. In this work we define a novel technique for information extraction from single webpages or collections of interconnected webpages. This technique is based on DOM distances to retrieve information. This allows the technique to work with any webpage and, thus, to retrieve information online. Our implementation and experiments demonstrate the usefulness of the technique. © 2012 Springer-Verlag.

Cite

CITATION STYLE

APA

Castillo, C., Valero, H., Ramos, J. G., & Silva, J. (2012). Information extraction from webpages based on DOM distances. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7182 LNCS, pp. 181–193). https://doi.org/10.1007/978-3-642-28601-8_16

Information extraction from webpages based on DOM distances

Abstract

Cite

Register to see more suggestions