Optimization analysis of hadoop

Jinglun Li; Shengfei Shi; Hongzhi Wang

Conference Proceedings

Optimization analysis of hadoop

Communications in Computer and Information Science (2016) 623 520-532

DOI: 10.1007/978-981-10-2053-7_46

1Citations

5Readers

Get full text

Abstract

Hadoop is a distributed data processing platform supporting MapReduce parallel computing framework. In order to deal with general problems, there is always a need of accelerating Hadoop under certain circumstance such as Hive jobs. By outputting current time to logs at specially selected points, we traced the workflow of a typical MapReduce job generated by Hive and making time statistics for every phase of the job. Using different data quantities, we compared the proportion of each phase and located the bottleneck points of Hadoop. We make two major optimization advices: (1) focus on using combine and optimizing Net Work and Disk IO when dealing with big jobs having a large number of intermediate results; (2) optimizing map function and Disk IO when dealing with short jobs.

Author supplied keywords

Cite

CITATION STYLE

APA

Li, J., Shi, S., & Wang, H. (2016). Optimization analysis of hadoop. In Communications in Computer and Information Science (Vol. 623, pp. 520–532). Springer Verlag. https://doi.org/10.1007/978-981-10-2053-7_46

Optimization analysis of hadoop

Abstract

Author supplied keywords

Cite

Register to see more suggestions