Processing Theta-Joins using MapReduce ∗ Categories and Subject Descriptors

  • Okcan A
ISSN: 07308078
N/ACitations
Citations of this article
67Readers
Mendeley users who have this article in their library.

Abstract

Joins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equi-joins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study the problem of how to map arbitrary join conditions to Map and Reduce functions, i.e., a parallel infrastructure that controls data flow based on key-equality only. Our proposed join model simplifies creation of and reasoning about joins in MapReduce. Using this model, we derive a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins (theta-joins) in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide evidence that for a variety of join problems, it is either close to optimal or the best possible option. For some of the problems where 1-Bucket-Theta is not the best choice, we show how to achieve better performance by exploiting additional input statistics. All algorithms can be made 'memory-aware', and they do not require any modifications to the MapReduce environment. Experiments show the effectiveness of our approach.

Cite

CITATION STYLE

APA

Okcan, A. (2011). Processing Theta-Joins using MapReduce ∗ Categories and Subject Descriptors. Processing, 949. Retrieved from http://portal.acm.org/citation.cfm?doid=1989323.1989423

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free