Abstract
A fundamental part of conducting cross-disciplinary web science research is having useful, high-quality datasets that provide value to studies across disciplines. In this paper, we introduce a large, handcoded corpus of online harassment data. A team of researchers collaboratively developed a codebook using grounded theory and labeled 35,000 tweets. Our resulting dataset has roughly 15% positive harassment examples and 85% negative examples. This data is useful for training machine learning models, identifying textual and linguistic features of online harassment, and for studying the nature of harassing comments and the culture of trolling. Copyright is held by the owner/author(s).
Author supplied keywords
Cite
CITATION STYLE
Golbeck, J., Ashktorab, Z., Banjo, R. O., Berlinger, A., Bhagwan, S., Buntain, C., … Wu, D. M. (2017). A large human-labeled corpus for online harassment research. In WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference (pp. 229–233). Association for Computing Machinery, Inc. https://doi.org/10.1145/3091478.3091509
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.