MapReduce Join Across Geo-Distributed Data Centers

Giuseppe Di Modica; Orazio Tomarchio

Conference Proceedings

MapReduce Join Across Geo-Distributed Data Centers

Communications in Computer and Information Science (2019) 1054 18-31

DOI: 10.1007/978-3-030-27355-2_2

1Citations

4Readers

Get full text

Abstract

MapReduce is with no doubt the parallel computation paradigm which has managed to interpret and serve at best the need, expressed in any field, of running fast and accurate analyses on Big Data. The strength of MapReduce is its capability of exploiting the computing power of a cluster of resources, by distributing the load on multiple computing units, and of scaling with the number of computing units. Today many data analysis algorithms are available in the MapReduce form: Data Sorting, Data Indexing, Word Counting, Relations Joining to name just a few. These algorithms have been observed to work fine in computing context where the computing units (nodes) connect by way of high performing network links (in the order of Gigabits per second). Unfortunately, when it comes to run MapReduce on nodes that are geographically distant to each other the performance dramatically degrades. Basically, in such scenarios the cost for moving data among nodes connected via geographic links counterbalances the benefit of parallelization. In this paper the issues of running MapReduce Joins in a geo-distributed computing context are discussed. Furthermore, we propose to boost the performance of the Join algorithm by leveraging a hierarchical computing approach.

Author supplied keywords

Cite

CITATION STYLE

APA

Di Modica, G., & Tomarchio, O. (2019). MapReduce Join Across Geo-Distributed Data Centers. In Communications in Computer and Information Science (Vol. 1054, pp. 18–31). Springer. https://doi.org/10.1007/978-3-030-27355-2_2

MapReduce Join Across Geo-Distributed Data Centers

Abstract

Author supplied keywords

Cite

Register to see more suggestions