Hadoop is a distributed data processing platform supporting MapReduce parallel computing framework. In order to deal with general problems, there is always a need of accelerating Hadoop under certain circumstance such as Hive jobs. By outputting current time to logs at specially selected points, we traced the workflow of a typical MapReduce job generated by Hive and making time statistics for every phase of the job. Using different data quantities, we compared the proportion of each phase and located the bottleneck points of Hadoop. We make two major optimization advices: (1) focus on using combine and optimizing Net Work and Disk IO when dealing with big jobs having a large number of intermediate results; (2) optimizing map function and Disk IO when dealing with short jobs.
CITATION STYLE
Li, J., Shi, S., & Wang, H. (2016). Optimization analysis of hadoop. In Communications in Computer and Information Science (Vol. 623, pp. 520–532). Springer Verlag. https://doi.org/10.1007/978-981-10-2053-7_46
Mendeley helps you to discover research relevant for your work.