CCWrapper: Adaptive predefined schema guided web extraction

0Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we propose a method called CCWrapper (Classification- Cluster) to extract target data items from web pages under the guide of the predefined schema. CCWrapper extracts and combines the different HTML nodes features, including the style, structure, thesaurus and data type attributes into one unified model, and generates the extraction rules with Bayes classification in the training step. When the new HTML page is handled, CCWrapper generates the probability of the target element for each HTML node and clusters the HTML nodes for extraction based on the intra-document relationship in the HTML document tree. The preliminary experimental results on real-life web sites demonstrate CCWrapper is a promising extraction method. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Gao, J., Yang, D., & Wang, T. (2006). CCWrapper: Adaptive predefined schema guided web extraction. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4016 LNCS, pp. 275–286). Springer Verlag. https://doi.org/10.1007/11775300_24

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free