Processing Theta-Joins using MapReduce ∗ Categories and Subject Descriptors

Alper Okcan

Journal Article

Processing Theta-Joins using MapReduce ∗ Categories and Subject Descriptors

Okcan A

Processing (2011) 949

ISSN: 07308078

N/ACitations

67Readers

Abstract

Joins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equi-joins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study the problem of how to map arbitrary join conditions to Map and Reduce functions, i.e., a parallel infrastructure that controls data flow based on key-equality only. Our proposed join model simplifies creation of and reasoning about joins in MapReduce. Using this model, we derive a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins (theta-joins) in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide evidence that for a variety of join problems, it is either close to optimal or the best possible option. For some of the problems where 1-Bucket-Theta is not the best choice, we show how to achieve better performance by exploiting additional input statistics. All algorithms can be made 'memory-aware', and they do not require any modifications to the MapReduce environment. Experiments show the effectiveness of our approach.

Author supplied keywords

Cite

CITATION STYLE

APA

Okcan, A. (2011). Processing Theta-Joins using MapReduce ∗ Categories and Subject Descriptors. Processing, 949. Retrieved from http://portal.acm.org/citation.cfm?doid=1989323.1989423

Processing Theta-Joins using MapReduce ∗ Categories and Subject Descriptors

Abstract

Author supplied keywords

Cite

Register to see more suggestions