Abstract
In this work we compare several data-driven approaches to the task of author's gender identification for texts with or without gender imitation. The data corpus has been specially gathered with crowdsourcing for this task. The best models are convolutional neural network with input of morphological data (fl-measure: 88%±3) for texts without imitation, and gradient boosting model with vector of character n-grams frequencies as input data (f1-measure: 64% ± 3) for texts with gender imitation. The method to filter the crowdsourced corpus using limited reference sample of texts to increase the accuracy of result is discussed.
Cite
CITATION STYLE
Sboev, A., Moloshnikov, I., Gudovskikh, D., & Rybka, R. (2017). A comparison of Data Driven models of solving the task of gender identification of author in Russian language texts for cases without and with the gender deception. In Journal of Physics: Conference Series (Vol. 937). Institute of Physics Publishing. https://doi.org/10.1088/1742-6596/937/1/012046
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.