The work presents a method to automatically generate a training dataset for the purpose of summarizing text documents with the help of feature extraction technique. The goal of this approach is to design a dataset which will help to perform the task of summarization very much like a human. A document summary is a text that is produced from one or more texts that conveys important information in the original texts. The proposed system consists of methods such as pre-processing, feature extraction, and generation of training dataset. For implementing the system, 50 test documents from DUC2002 is used. Each document is cleaned by preprocessing techniques such as sentence segmentation, tokenization, removing stop word, and word stemming. Eight important features are extracted for each sentence, and are converted as attributes for the training dataset. A high quality, proper training dataset is needed for achieving good quality in document summarization, and the proposed system aims in generating a well-defined training dataset that is sufficiently large enough and noise free for performing text summarization. The training dataset utilizes a set of features which are common that can be used for all subtasks of data mining. Primary subjective evaluation shows that our training is effective, efficient, and the performance of the system is promising.
CITATION STYLE
Hannah, E., & Mukherjee, S. (2014). An efficient training dataset generation method for extractive text summarization. In Advances in Intelligent Systems and Computing (Vol. 236, pp. 955–963). Springer Verlag. https://doi.org/10.1007/978-81-322-1602-5_101
Mendeley helps you to discover research relevant for your work.