Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

Tharwat EL-SAYED; Mohamed Badawy; Ayman El-Sayed

Journal ArticleOPEN ACCESS

Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

EL-SAYED T
Badawy M
El-Sayed A

Menoufia Journal of Electronic Engineering Research (2019) 28(1) 109-120

DOI: 10.21608/mjeer.2019.62728

N/ACitations

11Readers

Abstract

Hadoop is an open-source framework written by java and used for big data processing. It consists of two main components: Hadoop Distributed File System (HDFS) and MapReduce. HDFS is used to store data while MapReduce is used to distribute and process an application tasks in a distributed processing form. Recently, several researchers employ Hadoop for processing big data. The results indicate that Hadoop performs well with Large Files (files larger than Data Node block size). Nevertheless, Hadoop performance decreases with small files that are less than its block size. This is because, small files consume the memory of both the DataNode and the NameNode, and increases the execution time of the applications (i.e. decreases MapReduce performance). In this paper, the problem of the small files in Hadoop is defined and the existing approaches to solve this problem are classified and discussed. In addition, some open points that must be considered when thinking of a better approach to improve the Hadoop performance when processing the small files.

Cite

CITATION STYLE

APA

EL-SAYED, T., Badawy, M., & El-Sayed, A. (2019). Impact of Small Files on Hadoop Performance: Literature Survey and Open Points. Menoufia Journal of Electronic Engineering Research, 28(1), 109–120. https://doi.org/10.21608/mjeer.2019.62728

Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

Abstract

Cite

Register to see more suggestions