A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Rafi Trad; Myra Spiliopoulou

Conference ProceedingsOPEN ACCESS

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2021) 12695 LNCS 301-312

DOI: 10.1007/978-3-030-74251-5_24

4Citations

9Readers

Get full text

Abstract

Authorial clustering involves the grouping of documents written by the same author or team of authors without any prior positive examples of an author’s writing style or thematic preferences. For authorial clustering on shorter texts (paragraph-length texts that are typically shorter than conventional documents), the document representation is particularly important. We propose a high-level framework which utilizes a compact data representation in a latent feature space derived with non-parametric topic modeling. Authorial clusters are identified thereafter in two scenarios: (a) fully unsupervised and (b) semi-supervised where a small number of shorter texts are known to belong to the same author (must-link constraints) or not (cannot-link constraints). We report on experiments with 120 collections in three languages and two genres and show that the topic-based latent feature space provides a promising level of performance while reducing the dimensionality by a factor of 1500 compared to state-of-the-art. We also demonstrate that little knowledge on constraints in authorial clusters memberships leads to auspicious improvements in front of this difficult task.

Author supplied keywords

Cite

CITATION STYLE

APA

Trad, R., & Spiliopoulou, M. (2021). A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12695 LNCS, pp. 301–312). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-74251-5_24

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Abstract

Author supplied keywords

Cite

Register to see more suggestions