As the use of electronic documents are becoming more popular, people want to find documents completely or partially duplicate. In this paper, we propose a near duplicate text detection framework using signatures to save space and query time. We also propose a novel signature selection algorithm which uses collection frequency of q-grams. We compare our algorithm with Winnowing, which is one of the state-of-the-art signature selection algorithms. We show that our algorithm acquires much better accuracy with less time and space cost. We perform extensive experiments to verify our conclusion. © 2013 Springer-Verlag.
CITATION STYLE
Sun, Y., Qin, J., & Wang, W. (2013). Near duplicate text detection using frequency-biased signatures. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8180 LNCS, pp. 277–291). https://doi.org/10.1007/978-3-642-41230-1_24
Mendeley helps you to discover research relevant for your work.