Today, every new document added to the Web is augmented with semantic information (i.e., information about the content) which identifies the class of the document. The information is either added as keywords, or implicitly known from structural information like title, body text, or added as objects and their relationship (rich data format). But, the documents that enriched the Web five or ten years back do not contain semantic information. The objective of this paper is to cluster documents with missing semantic information. It is performed by adopting frequent term-based method exploiting the lexical and structural relation between keywords in the document. Similarity histogram clustering algorithm has been used to cluster the documents after deriving semantic information on concepts which identifies the class of the document. The results illustrate that the concept-based clustering performs well compared to statistical clustering k-means but suffers from proper subset selection of frequent terms.
CITATION STYLE
Anupriya, E., & Iyengar, N. C. S. N. (2014). Concept based clustering of documents with missing semantic information. In Advances in Intelligent Systems and Computing (Vol. 243, pp. 579–589). Springer Verlag. https://doi.org/10.1007/978-81-322-1665-0_57
Mendeley helps you to discover research relevant for your work.