Web content extraction using clustering with web structure

Xiaotao Huang; Yan Gao; Liqun Huang; Zhizhao Zhang; Yuhua Li; Fen Wang; Ling Kang

Conference Proceedings

Web content extraction using clustering with web structure

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2017) 10261 LNCS 95-103

DOI: 10.1007/978-3-319-59072-1_12

3Citations

1Readers

Get full text

Abstract

Web content extraction is an essential part of data preprocessing in web information system. An algorithm for web content extraction based on clustering with web structure is proposed. The whole process can be divided in two steps. In the first step, clustering with the web pages collected from different websites. During this processing, similarity measurement of web page based on dynamic programming of weight is used. First, the web page is parsed to DOM tree; second, the weight is assigned to every node according to the position of the node and the amount of nodes in same depth and the depth of the DOM tree; third, calculating the similarity of two pages according to the given formula. When the first step is finished, web pages with similar structure would be divided into a set. In the second step, pages in the same set are compared and the same parts of pages will be removed, thus the remain is the web content. Experiments show that the proposed algorithm works with great effectiveness and accuracy.

Author supplied keywords

Cite

CITATION STYLE

APA

Huang, X., Gao, Y., Huang, L., Zhang, Z., Li, Y., Wang, F., & Kang, L. (2017). Web content extraction using clustering with web structure. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10261 LNCS, pp. 95–103). Springer Verlag. https://doi.org/10.1007/978-3-319-59072-1_12

Web content extraction using clustering with web structure

Abstract

Author supplied keywords

Cite

Register to see more suggestions