Generating Cross-Domain Text Classification Corpora from Social Media Comments

3Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In natural language processing (NLP), cross-domain text classification problems like cross-topic, cross-genre or cross-language authorship attribution are characterized by having different contexts for training and testing data. That is, learning algorithms which are trained on the specific properties of the training data have to make predictions on test data which comprises substantially different properties. To this end, the corpora that are used for analyses in cross-domain problems are limited in size and variation, decreasing the expressive power and generalizability of the proposed solutions. In this paper, we present a methodological framework and toolset for dynamically creating cross-domain datasets by utilizing millions of Reddit comments. We show that different types of cross-domain datasets such as cross-topic or cross-lingual corpora can be constructed, and demonstrate a wide variety of use cases, including previously unfeasible analyses like cross-lingual authorship attribution on original, non-translated texts. Using state-of-the-art authorship attribution methods, we show the potential of a cross-topic corpus generated by our framework when compared to the corpora that were used in related approaches, and enable the advance of research previously limited by corpora availability.

Cite

CITATION STYLE

APA

Murauer, B., & Specht, G. (2019). Generating Cross-Domain Text Classification Corpora from Social Media Comments. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11696 LNCS, pp. 114–125). Springer Verlag. https://doi.org/10.1007/978-3-030-28577-7_7

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free