MapReduce is a distributed programming model for large-scale data processing. Hadoop as an open source implementation of the MapReduce programming model has been widely used due to its good scalability and fault tolerance. However, the default size of the split and Hadoop distributed file system (HDFS) block are the same, which makes the number of map tasks of the job increase linearly with the number of blocks. When input is large, the time for managing splits and initializing map tasks is considerable. In this paper, we propose a scheme, Block Aggregation MapReduce (BAMR), which automatically increases the split size appropriately according to input’s size in order to reduce the number of map tasks. With this scheme, the time of managing splits and initializing map tasks will be shorten. Experiment shows that BAMR reduces the execution time significantly.
CITATION STYLE
Li, J., Ai, L., & Ding, D. (2014). Mapreduce performance optimization based on block aggregation. Advances in Intelligent Systems and Computing, 255, 853–861. https://doi.org/10.1007/978-81-322-1759-6_97
Mendeley helps you to discover research relevant for your work.