A large human-labeled corpus for online harassment research

Jennifer Golbeck; Zahra Ashktorab; Rashad O. Banjo; Alexandra Berlinger; Siddharth Bhagwan; Cody Buntain; Paul Cheakalos; Alicia A. Geller; Quint Gergory; Rajesh Kumar Gnanasekaran; Raja Rajan Gunasekaran; Kelly M. Hoffman; Jenny Hottle; Vichita Jienjitlert; Shivika Khare; Ryan Lau; Marianna J. Martindale; Shalmali Naik; Heather L. Nixon; Piyush Ramachandran; Kristine M. Rogers; Lisa Rogers; Meghna Sardana Sarin; Gaurav Shahane; Jayanee Thanki; Priyanka Vengataraman; Zijian Wan; Derek Michael Wu

Conference Proceedings

A large human-labeled corpus for online harassment research

WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference (2017) 229-233

DOI: 10.1145/3091478.3091509

158Citations

141Readers

Get full text

Abstract

A fundamental part of conducting cross-disciplinary web science research is having useful, high-quality datasets that provide value to studies across disciplines. In this paper, we introduce a large, handcoded corpus of online harassment data. A team of researchers collaboratively developed a codebook using grounded theory and labeled 35,000 tweets. Our resulting dataset has roughly 15% positive harassment examples and 85% negative examples. This data is useful for training machine learning models, identifying textual and linguistic features of online harassment, and for studying the nature of harassing comments and the culture of trolling. Copyright is held by the owner/author(s).

Author supplied keywords

Cite

CITATION STYLE

APA

Golbeck, J., Ashktorab, Z., Banjo, R. O., Berlinger, A., Bhagwan, S., Buntain, C., … Wu, D. M. (2017). A large human-labeled corpus for online harassment research. In WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference (pp. 229–233). Association for Computing Machinery, Inc. https://doi.org/10.1145/3091478.3091509

A large human-labeled corpus for online harassment research

Abstract

Author supplied keywords

Cite

Register to see more suggestions