Measuring Contribution of HTML Features in Web Document Clustering

Esteban Meneses; Oldemar Rodríguez-Rojas

Journal ArticleOPEN ACCESS

Measuring Contribution of HTML Features in Web Document Clustering

Meneses E
Rodríguez-Rojas O

CLEI Electronic Journal (2008) 11(2)

DOI: 10.19153/cleiej.11.2.7

N/ACitations

79Readers

Abstract

Documents in HTML format have many features to analyze, from the terms in special sections to the phrases that appear in the whole document. However, it is important to decide which feature contributes the most to separate documents according to classes. Given this information, it is possible not to include certain feature in the representation for the document, given that it is expensive to compute and doesn’t contribute enough in the clustering process. By using a novel representation model and the standard k-means algorithm, we discovered that terms in the body of document contributes the most, followed by terms in other sections. Suffix tree provides poor contribution in that scenario, while term order graphs influence a little the partition. We used 4 known datasets to support the conclusions.

Cite

CITATION STYLE

APA

Meneses, E., & Rodríguez-Rojas, O. (2008). Measuring Contribution of HTML Features in Web Document Clustering. CLEI Electronic Journal, 11(2). https://doi.org/10.19153/cleiej.11.2.7

Measuring Contribution of HTML Features in Web Document Clustering

Abstract

Cite

Register to see more suggestions