Using grammatical inference to automate information extraction from the web

Theodore W. Hong; Keith L. Clark

Conference ProceedingsOPEN ACCESS

Using grammatical inference to automate information extraction from the web

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2001) 2168 216-227

DOI: 10.1007/3-540-44794-6_18

23Citations

10Readers

Abstract

The World-Wide Web contains a wealth of semistructured information sources that often give partial/overlapping views on the same domains, such as real estate listings or book prices. These partial sources could be used more effectively if integrated into a single view; however, since they are typically formatted in diverse ways for human viewing, extracting their data for integration is a difficult challenge. Existing learning systems for this task generally use hardcoded ad hoc heuristics, are restricted in the domains and structures they can recognize, and/or require manual training. We describe a principled method for automatically generating extraction wrappers using grammatical inference that can recognize general structures and does not rely on manually-labelled examples. Domain-specific knowledge is explicitly separated out in the form of declarative rules. The method is demonstrated in a test setting by extracting real estate listings from web pages and integrating them into an interactive data visualization tool based on dynamic queries.

Cite

CITATION STYLE

APA

Hong, T. W., & Clark, K. L. (2001). Using grammatical inference to automate information extraction from the web. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2168, pp. 216–227). Springer Verlag. https://doi.org/10.1007/3-540-44794-6_18

Using grammatical inference to automate information extraction from the web

Abstract

Cite

Register to see more suggestions