Distributed language representation for authorship attribution

Mirco Kocher; Jacques Savoy

Journal ArticleOPEN ACCESS

Distributed language representation for authorship attribution

Digital Scholarship in the Humanities (2018) 33(2) 425-441

DOI: 10.1093/llc/fqx046

12Citations

29Readers

Abstract

Distributed language representation (deep learning) has been applied successfully in different applications in natural language processing. Using this model, we propose and implement two new authorship attribution classifiers. In this perspective, a vector-space representation can be generated for each author or disputed text according to words and their nearby context. To determine the authorship of a disputed text, the cosine similarity between vector representations can be applied. The proposed strategies can be adapted without any difficulty to different languages (such as English and Italian) or genres (essays, political speeches, and newspaper articles). Evaluations using the k-nearest neighbors (k-NNs))and based on four test collections (the Federalist Papers, the State of the Union addresses, the Glasgow Herald, and La Stampa newspapers) indicate that the distributed language representation preforms well, providing sometimes better effectiveness than state-of-the-art methods such as k-NN, nearest shrunken centroids, chi-square, Delta, latent Dirichlet allocation, or multi-layer perceptron classifier.

Cite

CITATION STYLE

APA

Kocher, M., & Savoy, J. (2018). Distributed language representation for authorship attribution. Digital Scholarship in the Humanities, 33(2), 425–441. https://doi.org/10.1093/llc/fqx046

Distributed language representation for authorship attribution

Abstract

Cite

Register to see more suggestions