Using structured tokens to identify webpages for data extraction

Ling Lin; Lizhu Zhou; Qi Guo; Gang Li

Conference Proceedings

Using structured tokens to identify webpages for data extraction

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2007) 4505 LNCS 241-252

DOI: 10.1007/978-3-540-72524-4_27

0Citations

3Readers

Get full text

Abstract

As the web grows, more and more data has become available from webpages, such as the product items from the back-end databases. To provide efficient access to the data objects contained in these pages, data extraction plays an important role. However, identifying the suitable webpages to feed the data extraction is a pre-requisite and non-trivial task. As a result, there is an increasing need for methods that can automatically identify the target pages from unknown websites. In this paper, we solve the problem by exploiting the structured-token features of the webpage content, and applying decision tree based classification algorithm to induce the structure information. Furthermore, a preliminary recognition of data-object is acquired to efficiently initiate the subsequential data extraction. We experiment our approach on the real-world data, and achieve promising results. © Springer-Verlag Berlin Heidelberg 2007.

Cite

CITATION STYLE

APA

Lin, L., Zhou, L., Guo, Q., & Li, G. (2007). Using structured tokens to identify webpages for data extraction. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4505 LNCS, pp. 241–252). Springer Verlag. https://doi.org/10.1007/978-3-540-72524-4_27

Using structured tokens to identify webpages for data extraction

Abstract

Cite

Register to see more suggestions