In order to conquer the major challenges of current web document clustering, i.e. huge volume of documents, high dimensional process and understandability of the cluster, we propose a simple hybrid algorithm (SHDC) based on top-k frequent term sets and k-means. Top-k frequent term sets are used to produce k initial means, which are regarded as initial clusters and further refined by k-means. The final optimal clustering is returned by k-means while the understandable description of clustering is provided by k frequent term sets. Experimental results on two public datasets indicate that SHDC outperforms other two representative clustering algorithms (the farthest first k-means and random initial k-means) both on efficiency and effectiveness. © Springer-Verlag Berlin Heidelberg 2007.
CITATION STYLE
Wang, L., Tian, L., Jia, Y., & Han, W. (2007). A hybrid algorithm for web document clustering based on frequent term sets and k-means. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4537 LNCS, pp. 198–203). Springer Verlag. https://doi.org/10.1007/978-3-540-72909-9_20
Mendeley helps you to discover research relevant for your work.