Efficient parallel set-similarity joins using MapReduce

395Citations
Citations of this article
235Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop. © 2010 ACM.

Author supplied keywords

Cite

CITATION STYLE

APA

Vernica, R., Carey, M. J., & Li, C. (2010). Efficient parallel set-similarity joins using MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 495–506). https://doi.org/10.1145/1807167.1807222

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free