Processing theta-joins using MapReduce

Alper Okcan; Mirek Riedewald

Conference Proceedings

Processing theta-joins using MapReduce

Proceedings of the ACM SIGMOD International Conference on Management of Data (2011) 949-960

DOI: 10.1145/1989323.1989423

209Citations

79Readers

Get full text

Abstract

Joins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equi-joins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study the problem of how to map arbitrary join conditions to Map and Reduce functions, i.e., a parallel infrastructure that controls data flow based on key-equality only. Our proposed join model simplifies creation of and reasoning about joins in MapReduce. Using this model, we derive a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins (theta-joins) in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide evidence that for a variety of join problems, it is either close to optimal or the best possible option. For some of the problems where 1-Bucket-Theta is not the best choice, we show how to achieve better performance by exploiting additional input statistics. All algorithms can be made 'memory-aware', and they do not require any modifications to the MapReduce environment. Experiments show the effectiveness of our approach. © 2011 ACM.

Author supplied keywords

Cite

CITATION STYLE

APA

Okcan, A., & Riedewald, M. (2011). Processing theta-joins using MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 949–960). Association for Computing Machinery. https://doi.org/10.1145/1989323.1989423

Processing theta-joins using MapReduce

Abstract

Author supplied keywords

Cite

Register to see more suggestions