Abstract
Large language models (LLMs) have replaced the metaphorical monkeys in the “infinite monkeys” thought experiment with machines that mirror human writing. With LLMs being used to generate content at an unprecedented scale, concerns over their misuse and the saturation of the content space with artificially generated material are growing. We foresee a point in the future where a vast majority of all the possible text in a given language would have already been generated, leading to a “Plagiarism Singularity". In this paper, we provide predictions on how far we are from this singularity in the form of an estimate of the volume of content that needs to be generated to reach this singularity. We use an LLM to calculate the probability distribution of sentences in the English language collected from a large dataset. We then estimate the minimum number of sentences to be generated to cover different percentiles of the probability mass of the set of all sentences, assuming they follow the calculated distribution, by treating the problem as an instance of the coupon collector's problem. We find that breaching the standard 20% plagiarism limit would only need around 1030 sentences to be generated, which we estimate to happen in approximately 40 years from now.
Cite
CITATION STYLE
Ranga, S., Mao, R., Cambria, E., & Chattopadhyay, A. (2025). The Plagiarism Singularity Conjecture. In Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025 (Vol. 1, pp. 10245–10255). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2025.naacl-long.514
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.