Introducing CAD: the Contextual Abuse Dataset

Bertie Vidgen; Dong Nguyen; Helen Margetts; Patricia Rossini; Rebekah Tromble

Conference ProceedingsOPEN ACCESS

Introducing CAD: the Contextual Abuse Dataset

NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (2021) 2289-2303

DOI: 10.18653/v1/2021.naacl-main.182

67Citations

88Readers

Abstract

Online abuse can inflict harm on users and communities, making online spaces unsafe and toxic. Progress in automatically detecting and classifying abusive content is often held back by the lack of high quality and detailed datasets. We introduce a new dataset of primarily English Reddit entries which addresses several limitations of prior work. It (1) contains six conceptually distinct primary categories as well as secondary categories, (2) has labels annotated in the context of the conversation thread, (3) contains rationales and (4) uses an expert-driven group-adjudication process for high quality annotations. We report several baseline models to benchmark the work of future researchers. The annotated dataset, annotation guidelines, models and code are freely available.

Cite

CITATION STYLE

APA

Vidgen, B., Nguyen, D., Margetts, H., Rossini, P., & Tromble, R. (2021). Introducing CAD: the Contextual Abuse Dataset. In NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 2289–2303). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.naacl-main.182

Introducing CAD: the Contextual Abuse Dataset

Abstract

Cite

Register to see more suggestions