We propose a technique that takes two or more web pages generated by the same server-side template and tries to learn a regular expression that represents it and helps extract relevant information from similar pages. Our experimental results on real-world web sites demonstrate that our technique outperforms others in terms of both effectiveness and efficiency and is not affected by HTML errors. © 2012 Springer-Verlag.
CITATION STYLE
Sleiman, H. A., & Corchuelo, R. (2012). An unsupervised technique to extract information from semi-structured web pages. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7651 LNCS, pp. 631–637). https://doi.org/10.1007/978-3-642-35063-4_46
Mendeley helps you to discover research relevant for your work.