Processing theta-joins using MapReduce

209Citations
Citations of this article
79Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Joins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equi-joins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study the problem of how to map arbitrary join conditions to Map and Reduce functions, i.e., a parallel infrastructure that controls data flow based on key-equality only. Our proposed join model simplifies creation of and reasoning about joins in MapReduce. Using this model, we derive a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins (theta-joins) in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide evidence that for a variety of join problems, it is either close to optimal or the best possible option. For some of the problems where 1-Bucket-Theta is not the best choice, we show how to achieve better performance by exploiting additional input statistics. All algorithms can be made 'memory-aware', and they do not require any modifications to the MapReduce environment. Experiments show the effectiveness of our approach. © 2011 ACM.

Author supplied keywords

Cite

CITATION STYLE

APA

Okcan, A., & Riedewald, M. (2011). Processing theta-joins using MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 949–960). Association for Computing Machinery. https://doi.org/10.1145/1989323.1989423

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free