Pivot-based similarity wide-joins fostering near-duplicate detection

0Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Monitoring systems targeting to improve decision making in emergency scenarios are currently benefiting from crowdsourcing information. The main issue with such kind of data is that the gathered reports quickly become too similar among themselves. Hence, too much similar reports, namely near-duplicates, do not add valuable knowledge to assist crisis control committees in their decision making tasks. The current approaches to detect near-duplicates are usually based on a twofold processing, where the first phase relies on similarity queries or clustering techniques, whereas the second and most computationally costly phase refines the result from the first one. Aimed at reducing that cost and also improving the ability of near-duplication detection, we developed a framework model based on the similarity wide-join database operator. This paper extends the wide-join definition empowering it to surpass its restrictions and provides an efficient algorithm based on pivots that speeds up the entire process, whereas enabling to retrieve the most similar elements in a single-pass. We also investigate alternatives and propose efficient algorithms to choose the pivots. Experiments using real datasets show that our framework is up to three orders of magnitude faster than the competing techniques in the literature, whereas it also improves the quality of the result in about 35%.

Cite

CITATION STYLE

APA

Carvalho, L. O., Santos, L. F. D., Traina, A. J. M., & Traina, C. (2017). Pivot-based similarity wide-joins fostering near-duplicate detection. In Lecture Notes in Business Information Processing (Vol. 291, pp. 81–104). Springer Verlag. https://doi.org/10.1007/978-3-319-62386-3_4

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free