Improved Clustering of Documents using K-means Algorithm

  • Merlin Jacob
  • Anina John
N/ACitations
Citations of this article
11Readers
Mendeley users who have this article in their library.
Get full text

Abstract

With the huge upsurge in information, it has become difficult to gather relevant information within the limited time. Hence clustering methods are introduced to ease the task of gathering the relevant information in a cluster. Efficiency of clustering therefore becomes one of the crucial requirements to be met by the clustering methods. There are several methods and algorithms have been introduced. Hierarchical clustering is often portrayed as the better quality clustering approach, but it is limited because of its time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents. A clustering method based on the hidden semantics within the documents is proposed here for better results. The proposed method extracts features from the web documents using conditional random fields and builds a linguistic topological space based on the associations of features. The features that are used this method are TF (Term Frequency) and IDF (Inverse Document Frequency). Both TF and IDF values are best in reflecting the importance of the document in the given context. Then the documents are clustered based on the K-means clustering after finding the topics in the documents using these features. The advantage of K-means method is that it produces tighter clusters than hierarchical clustering, especially if the clusters are globular.

Cite

CITATION STYLE

APA

Merlin Jacob, & Anina John. (2016). Improved Clustering of Documents using K-means Algorithm. International Journal of Engineering Research And, V5(07). https://doi.org/10.17577/ijertv5is070358

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free