MapReduce is a brilliant distributed computing strategy to process massive-scale data. However, for iterative applications, the general MapReduce needs to re-initialize runtime environment repetitively and re-load static data repetitively in every iteration. Thus, a great deal of CPU time and I/O bandwidth are wasted. This paper presents a lightweight solution to improve the efficiency of iterative MapReduce, which named MapCombine. The main contributions of MapCombine are as follows: (1) To avoid re-initialization of the runtime environment, a controller component is plugged into the general MapReduce model to schedule the iterations; (2) To process data without reloading the static subset, we modify the general MapReduce model surrounding combine phase to cache the fixed data and 4e the workload before processing; (3) To make the communication between the controller and the combiners flexible with the consideration of fault tolerance and downtime recovery, we append an interaction layer to the MapReduce implementation architecture. We also show performance comparisons between MapCombine and Mahout for four clustering algorithms, and then conclude that the average speedup ratio provided by MapCombine is 1.14. © Springer-Verlag Berlin Heidelberg 2012.
CITATION STYLE
Xu, W., Gong, X., & Li, X. (2013). Mapcombine: A lightweight solution to improve the efficiency of iterative mapreduce. Communications in Computer and Information Science, 332, 444–456. https://doi.org/10.1007/978-3-642-34447-3_40
Mendeley helps you to discover research relevant for your work.