Web discussion forums typically contain posts that fall into different categories such as question, solution, feedback, spam, etc. Automatic identification of these categories can aid information retrieval that is tailored for specific user requirements. Previously, a number of supervised methods have attempted to solve this problem; however, these depend on the availability of abundant training data. A few existing unsupervised and semi-supervised approaches are either focused on identifying only one or two categories, or do not discuss category-specific performance. In contrast, this work proposes methods for identifying multiple categories, and also analyzes the category-specific performance. These methods are based on sequence models (specifically, hidden Markov Models) that can model language for each category using both probabilistic word and part-of-speech information, and minimal manually specified features. The unsupervised version initializes the models using clustering, whereas the semi-supervised version uses few manually labeled forum posts. Empirical evaluations demonstrate that these methods are more accurate than previous ones.
CITATION STYLE
Perumal, K., & Hirst, G. (2016). Semi-supervised and unsupervised categorization of posts in Web discussion forums using part-of-speech information and minimal features. In Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA 2016 at the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2016 (pp. 100–108). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w16-0417
Mendeley helps you to discover research relevant for your work.