DeTox: A Comprehensive Dataset for German Offensive Language and Conversation Analysis

18Citations
Citations of this article
32Readers
Mendeley users who have this article in their library.

Abstract

In this work, we present a publicly available offensive language dataset (DeTox-dataset) containing 10,278 annotated German social media comments collected in the first half of 2021. With twelve different annotation categories annotated by six annotators, it is far more comprehensive than other datasets, and goes beyond just hate speech detection. The labels aim in particular also at toxicity, criminal relevance and discrimination types of comments. Furthermore, about half of the comments are from coherent parts of conversations, which opens the possibility to consider the comments contexts and do conversation analyses in order to research the contagion of offensive language in conversations. The dataset is available in our GitHub repository: https://github.com/hdaSprachtechnologie/detox

Cite

CITATION STYLE

APA

Demus, C., Pitz, J., Schütz, M., Probol, N., Siegel, M., & Labudde, D. (2022). DeTox: A Comprehensive Dataset for German Offensive Language and Conversation Analysis. In WOAH 2022 - 6th Workshop on Online Abuse and Harms, Proceedings of the Workshop (pp. 143–153). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.woah-1.14

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free