An unsupervised technique to extract information from semi-structured web pages

Hassan A. Sleiman; Rafael Corchuelo

Conference Proceedings

An unsupervised technique to extract information from semi-structured web pages

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7651 LNCS 631-637

DOI: 10.1007/978-3-642-35063-4_46

7Citations

5Readers

Get full text

Abstract

We propose a technique that takes two or more web pages generated by the same server-side template and tries to learn a regular expression that represents it and helps extract relevant information from similar pages. Our experimental results on real-world web sites demonstrate that our technique outperforms others in terms of both effectiveness and efficiency and is not affected by HTML errors. © 2012 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Sleiman, H. A., & Corchuelo, R. (2012). An unsupervised technique to extract information from semi-structured web pages. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7651 LNCS, pp. 631–637). https://doi.org/10.1007/978-3-642-35063-4_46

An unsupervised technique to extract information from semi-structured web pages

Abstract

Author supplied keywords

Cite

Register to see more suggestions