Web content extraction using clustering with web structure

3Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Web content extraction is an essential part of data preprocessing in web information system. An algorithm for web content extraction based on clustering with web structure is proposed. The whole process can be divided in two steps. In the first step, clustering with the web pages collected from different websites. During this processing, similarity measurement of web page based on dynamic programming of weight is used. First, the web page is parsed to DOM tree; second, the weight is assigned to every node according to the position of the node and the amount of nodes in same depth and the depth of the DOM tree; third, calculating the similarity of two pages according to the given formula. When the first step is finished, web pages with similar structure would be divided into a set. In the second step, pages in the same set are compared and the same parts of pages will be removed, thus the remain is the web content. Experiments show that the proposed algorithm works with great effectiveness and accuracy.

Cite

CITATION STYLE

APA

Huang, X., Gao, Y., Huang, L., Zhang, Z., Li, Y., Wang, F., & Kang, L. (2017). Web content extraction using clustering with web structure. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10261 LNCS, pp. 95–103). Springer Verlag. https://doi.org/10.1007/978-3-319-59072-1_12

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free