Due to the permantently growing amount of textual data, automatic methods for organizing the data are needed. Automatic text classication is one of this methods. It automatically assigns documents to a set of classes based on the textual content of the document. Normally, the set of classes is hierarchically structured but today's classication approaches ignore hierarchical structures, thereby loosing valuable human knowledge. This thesis exploits the hierarchical organization of classes to improve accuracy and reduce computational complexity. Classi cation methods from machine learning, namely BoosTexter and the newly introduced Centroid- Boosting algorithm, are used for learning hierarchies. In doing so, error propagation from higher level nodes and comparing decisions between independently trained leaf nodes are two problems which are considered in this thesis. Experiments are performed on the Reuters 21578, the Reuters Corpus Volume 1 and the Ohsumed data set, which are well known in literature. Rocchio and Support Vector Machines, which are state of the art algorithms in the eld of text classication, serve as base line classiers. Comparing algorithms is done by applying statistical signicance tests. Results show that, depending on the structure of a hierarchy, accuracy improves and computational complexity decreases due to hierarchical classi- cation. Also, the introduced model for comparing leaf nodes yields an increase in performance.
CITATION STYLE
Granitzer, M. (2003). Hierarchical text classification using methods from machine learning. Master’s Thesis, Graz University of Technology. Retrieved from http://know-center.tugraz.at/wp-content/uploads/2010/12/2004_Dip_MGranitzer1.pdf
Mendeley helps you to discover research relevant for your work.