In this work, we present a publicly available offensive language dataset (DeTox-dataset) containing 10,278 annotated German social media comments collected in the first half of 2021. With twelve different annotation categories annotated by six annotators, it is far more comprehensive than other datasets, and goes beyond just hate speech detection. The labels aim in particular also at toxicity, criminal relevance and discrimination types of comments. Furthermore, about half of the comments are from coherent parts of conversations, which opens the possibility to consider the comments contexts and do conversation analyses in order to research the contagion of offensive language in conversations. The dataset is available in our GitHub repository: https://github.com/hdaSprachtechnologie/detox
CITATION STYLE
Demus, C., Pitz, J., Schütz, M., Probol, N., Siegel, M., & Labudde, D. (2022). DeTox: A Comprehensive Dataset for German Offensive Language and Conversation Analysis. In WOAH 2022 - 6th Workshop on Online Abuse and Harms, Proceedings of the Workshop (pp. 143–153). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.woah-1.14
Mendeley helps you to discover research relevant for your work.