Entropy-based authorship search in large document collections

9Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The purpose of authorship search is to identify documents written by a particular author in large document collections. Standard search engines match documents to queries based on topic, and are not applicable to authorship search. In this paper we propose an approach to authorship search based on information theory. We propose relative entropy of style markers for ranking, inspired by the language models used in information retrieval. Our experiments on collections of newswire texts show that, with simple style markers and sufficient training data, documents by a particular author can be accurately found from within large collections. Although effectiveness does degrade as collection size is increased, with even 500,000 documents nearly half of the top-ranked documents are correct matches. We have also found that the authorship search approach can be used for authorship attribution, and is much more scalable than state-of-art approaches in terms of the collection size and the number of candidate authors. © Springer-Verlag Berlin Heidelberg 2007.

Cite

CITATION STYLE

APA

Zhao, Y., & Zobel, J. (2007). Entropy-based authorship search in large document collections. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4425 LNCS, pp. 381–392). Springer Verlag. https://doi.org/10.1007/978-3-540-71496-5_35

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free