Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

  • EL-SAYED T
  • Badawy M
  • El-Sayed A
N/ACitations
Citations of this article
11Readers
Mendeley users who have this article in their library.

Abstract

Hadoop is an open-source framework written by java and used for big data processing. It consists of two main components: Hadoop Distributed File System (HDFS) and MapReduce. HDFS is used to store data while MapReduce is used to distribute and process an application tasks in a distributed processing form. Recently, several researchers employ Hadoop for processing big data. The results indicate that Hadoop performs well with Large Files (files larger than Data Node block size). Nevertheless, Hadoop performance decreases with small files that are less than its block size. This is because, small files consume the memory of both the DataNode and the NameNode, and increases the execution time of the applications (i.e. decreases MapReduce performance). In this paper, the problem of the small files in Hadoop is defined and the existing approaches to solve this problem are classified and discussed. In addition, some open points that must be considered when thinking of a better approach to improve the Hadoop performance when processing the small files.

Cite

CITATION STYLE

APA

EL-SAYED, T., Badawy, M., & El-Sayed, A. (2019). Impact of Small Files on Hadoop Performance: Literature Survey and Open Points. Menoufia Journal of Electronic Engineering Research, 28(1), 109–120. https://doi.org/10.21608/mjeer.2019.62728

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free