Wrapper inference for ambiguous Web pages

Valter Crescenzi; Paolo Merialdo

Journal ArticleOPEN ACCESS

Wrapper inference for ambiguous Web pages

Applied Artificial Intelligence (2008) 22(1-2) 21-52

DOI: 10.1080/08839510701853093

23Citations

9Readers

Abstract

Several studies have concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called prefix mark-up languages, that nicely abstract the structures usually found in HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many real-life web pages do not fall in this class of languages. In this article we analyze the roots of the problem and we propose a technique to transform pages in order to bring them into the class of prefix mark-up languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We report on some experiments that we have conducted on real-life web pages to evaluate the approach; the results of this activity demonstrate the effectiveness of the presented techniques.

Cite

CITATION STYLE

APA

Crescenzi, V., & Merialdo, P. (2008). Wrapper inference for ambiguous Web pages. Applied Artificial Intelligence, 22(1–2), 21–52. https://doi.org/10.1080/08839510701853093

Wrapper inference for ambiguous Web pages

Abstract

Cite

Register to see more suggestions