Not all character N-grams are created equal: A study in authorship attribution

164Citations
Citations of this article
167Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Character n-grams have been identified as the most successful feature in both singledomain and cross-domain Authorship Attribution (AA), but the reasons for their discriminative value were not fully understood. We identify subgroups of character n-grams that correspond to linguistic aspects commonly claimed to be covered by these features: morphosyntax, thematic content and style. We evaluate the predictiveness of each of these groups in two AA settings: a single domain setting and a cross-domain setting where multiple topics are present. We demonstrate that character ngrams that capture information about affixes and punctuation account for almost all of the power of character n-grams as features. Our study contributes new insights into the use of n-grams for future AA work and other classification tasks.

Cite

CITATION STYLE

APA

Sapkota, U., Bethard, S., Montes-Y-Gómez, M., & Solorio, T. (2015). Not all character N-grams are created equal: A study in authorship attribution. In NAACL HLT 2015 - 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 93–102). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/n15-1010

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free