Page-level wrapper verification for unsupervised web data extraction

Chia Hui Chang; Yen Ling Lin; Kuan Chen Lin; Mohammed Kayed

Conference Proceedings

Page-level wrapper verification for unsupervised web data extraction

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2013) 8180 LNCS(PART 1) 454-467

DOI: 10.1007/978-3-642-41230-1_38

4Citations

5Readers

Get full text

Abstract

Unsupervised information extraction has been studied a lot in the past decade. However, not much attention has been paid to its wrapper maintenance. In this paper, we study wrapper construction and verification problem based on the given schema and template which is induced from unsupervised page-level wrapper induction system. We model the verification problem as a constraint satisfaction problem (CSP) for leaf node label assignment with respect to constraints specified by a finite state machine (FSM) which is constructed from previous learned schema and template. If there exists no solution to the CSP, i.e. no valid label sequence exists, we say the test page fails the verification; otherwise, we rank all valid label sequences by measuring the fitness of each label sequence for extraction. We evaluate the FSM based approach with XML validation via false positive rate and false negative rate and measure the extraction performance through extraction accuracy. The experimental result shows the proposed method can effectively filter invalid pages (zero false positive rate) and rank the correct label sequence with the highest score with 96.5% accuracy. © 2013 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Chang, C. H., Lin, Y. L., Lin, K. C., & Kayed, M. (2013). Page-level wrapper verification for unsupervised web data extraction. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8180 LNCS, pp. 454–467). https://doi.org/10.1007/978-3-642-41230-1_38

Page-level wrapper verification for unsupervised web data extraction

Abstract

Author supplied keywords

Cite

Register to see more suggestions