The aim of this work is to build a generic model of Document Clustering that automatically groups together the related documents. Model is built with unsupervised and supervised learning with the assumption of no prior knowledge of the given domain. No manual effort is required for creating the training document set, instead the proposed model automatically generates training document. After that, it uses those for categorizing text documents. In the proposed model, the entire process is broadly divided into two steps. First, the initial classification is done in an unsupervised way. Apply K-means algorithm on the unlabeled documents in order to prepare the training dataset. Text documents are represented here as feature vector format where keywords extracted are considered as a feature. Here the selected representative documents are considered as the initial centroids. In step 2, create a supervised classifier on the initially categorized set. The categorized documents resulted from the previous step are used to train the supervised classifier. Naive Bayes classifier will be used as a statistical text classifier which uses word frequencies as features.
CITATION STYLE
Patra, R. (2021). Automated document categorization model. In Studies in Computational Intelligence (Vol. 907, pp. 19–36). Springer. https://doi.org/10.1007/978-3-030-50641-4_2
Mendeley helps you to discover research relevant for your work.