Abstract
Character n-grams have been identified as the most successful feature in both singledomain and cross-domain Authorship Attribution (AA), but the reasons for their discriminative value were not fully understood. We identify subgroups of character n-grams that correspond to linguistic aspects commonly claimed to be covered by these features: morphosyntax, thematic content and style. We evaluate the predictiveness of each of these groups in two AA settings: a single domain setting and a cross-domain setting where multiple topics are present. We demonstrate that character ngrams that capture information about affixes and punctuation account for almost all of the power of character n-grams as features. Our study contributes new insights into the use of n-grams for future AA work and other classification tasks.
Cite
CITATION STYLE
Sapkota, U., Bethard, S., Montes-Y-Gómez, M., & Solorio, T. (2015). Not all character N-grams are created equal: A study in authorship attribution. In NAACL HLT 2015 - 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 93–102). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/n15-1010
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.