A large human-labeled corpus for online harassment research

158Citations
Citations of this article
141Readers
Mendeley users who have this article in their library.
Get full text

Abstract

A fundamental part of conducting cross-disciplinary web science research is having useful, high-quality datasets that provide value to studies across disciplines. In this paper, we introduce a large, handcoded corpus of online harassment data. A team of researchers collaboratively developed a codebook using grounded theory and labeled 35,000 tweets. Our resulting dataset has roughly 15% positive harassment examples and 85% negative examples. This data is useful for training machine learning models, identifying textual and linguistic features of online harassment, and for studying the nature of harassing comments and the culture of trolling. Copyright is held by the owner/author(s).

Author supplied keywords

Cite

CITATION STYLE

APA

Golbeck, J., Ashktorab, Z., Banjo, R. O., Berlinger, A., Bhagwan, S., Buntain, C., … Wu, D. M. (2017). A large human-labeled corpus for online harassment research. In WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference (pp. 229–233). Association for Computing Machinery, Inc. https://doi.org/10.1145/3091478.3091509

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free