Social networks sometimes become a medium for threats, insults and other components of cyberbullying. A huge number of people are involved in online social networks. Hence, a protection of network users from anti-social behavior is an important activity. One of the major tasks of such activity is automated detecting the toxic comments with threats, insults, obscene etc. The bag of words statistics and bag of symbols statistics are typical features for the toxic comments detection. The effect of syntactic dependencies in sentences on the quality of detection of the social network toxic comments is studied in the article for the first time. Syntactic dependences are relationships with proper nouns, personal pronouns, possessive pronouns, etc. Twenty syntactic features of sentences have been verified in the total. The paper shows that 3 additional specific features significantly improve the quality of toxic comments detection. These three features are: the number of dependences with proper nouns in the singular, the number of dependences that contain bad words, and the number of dependences between personal pronouns and bad words. The experiments are based on data from kaggle competition "Toxic Comment Classification Challenge". For our experiments, the original dataset with 159751 comments was reduced to 106590 comments due to problems with human-free extraction of the syntactic features. We use mean of the error rates for each types of misclassification as the metric of quality due to unbalanced dataset. A decision tree is used as a classifier. The decision trees were synthesized for two splitting rules: Gini index and deviance criterion.
CITATION STYLE
Shtovba, S., Shtovba, O., & Petrychko, M. (2019). Detection of social network toxic comments with usage of syntactic dependencies in the sentences. In CEUR Workshop Proceedings (Vol. 2353, pp. 313–323). CEUR-WS. https://doi.org/10.32782/cmis/2353-25
Mendeley helps you to discover research relevant for your work.