Classifying blog posts by topics is useful for applications such as search and marketing. However, topic classification is time consuming and error prone, especially in an open domain such as the blogosphere. The state-of-the-art relies on supervised methods, requiring considerable training effort, that use the whole corpus vocabulary as features, demanding considerable memory to process. We show an effective alternative whereby distant supervision is used to obtain training data: we use Wikipedia articles labelled with Freebase domains. We address the memory requirements by using only named entities as features. We test our classifier on a sample of blog posts, and report up to 0.69 accuracy for multi-class labelling and 0.9 for binary classification
CITATION STYLE
Husby, S. D., & Barbosa, D. (2012). Topic Classification of Blog Posts Using Distant Supervision. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 28–36. Retrieved from http://aclweb.org/anthology/W/W12/W12-0604.pdf
Mendeley helps you to discover research relevant for your work.