Document clustering is a text mining technique wherein a document collection is divided into significant clusters by making use of a suitable distance or similarity measure. Distance measure plays an important role in document clustering. Here similar content is assigned to the same clusters while dissimilar content is assigned to different clusters. This is achieved by minimizing the intra-cluster distance between documents and maximizing the distance between clusters. A variety of distance measures used in document clustering are Euclidean distance, Squared Euclidean distance, Minkowski distance, Chebychev distance, power distance, percent disagreement, Manhattan distance, Bit- Vector distance, comparative-clustering distance, Huffman-code distance and Dominance-based distance. In this paper we have introduced a new similarity measure namely, Bipartite Graph Energy Based Similarity (BGEBS) based on the energy of a bipartite graph for document clustering. BGEBS helps to cluster the documents by considering the energy of a bipartite graph representation of the document collection. We have compared our measure BGEBS with Euclidean, Jaccard, Cosine, Canberra, Manhattan and Maximum Distance and clustering is carried out using k-means to form clusters. We then compare and analyze our result with a synthetic data set containing 6 documents. we have also evaluated using few benchmark data sets like CLASSIC, WEBKB and BBC. To validate our measure we have used the internal quality measure, sum of squares within (SSW). The values obtained using SSW for the various distance measures when compared to our BGEBS proves to be good.
Hannah Grace, G., & Desikan, K. (2019). Bipartite graph energy based similarity measure for document clustering. International Journal of Recent Technology and Engineering, 7(6), 194–200.