Authorship attribution could be considered as style-based text categorization problem. This paper presents an empirical study of performing style-based poetry categorization with the bag-of-words representation on 406 same theme English poems of five poets from World War I era. We investigated the impact of applying stop-words removal, stemming, and feature selection methods on the categorization performance of Support Vector Machine and Naïve Bayes Classifier. We found that these two models achieve best performance when stop-words removal and stemming are not applied on the training datasets, and the performance of Naïve Bayes Classifier is improved by performing feature selection methods. We also compared the best categorization performance of the bag-of-words representation with that of the stylometric representation including lexical features, such as function words and high frequency words, and found that the bag-of-words representation outperforms the stylometric representation.
CITATION STYLE
Gallagher, C., & Li, Y. (2019). Text categorization for authorship attribution in english poetry. In Advances in Intelligent Systems and Computing (Vol. 858, pp. 249–261). Springer Verlag. https://doi.org/10.1007/978-3-030-01174-1_19
Mendeley helps you to discover research relevant for your work.