HDFS and MapReduce

Deepak Vohra

Book Chapter

HDFS and MapReduce

Vohra D

Apress, (2016), 163-205

DOI: 10.1007/978-1-4842-2199-0_2

N/ACitations

2Readers

Get full text

Abstract

Apache Hadoop is a distributed framework for storing and processing large quantities of data. Going over each of the terms in the previous statement, "distributed" implies that Hadoop is distributed across several (tens, hundreds, or even thousands) of nodes in a cluster. For " storing and processing " means that Hadoop uses two different frameworks: Hadoop Distributed Filesystem (HDFS) for storage and MapReduce for processing. This is illustrated in Figure 2-1. What makes Hadoop different from other distributed frameworks is that it can store large quantities of data and process large quantities of data in parallel at a rate much faster than most other frameworks. "Large quantities" implies in the range of 100s and 1000s of terabytes (TB) to several (10s to 100s) petabytes (PB). Yahoo email is hosted on Hadoop and each Yahoo email account is provided with 1TB, which is 1000GB. Most online storage providers provision much less in comparison; for example, Google online storage provides a free starter storage of only 15GB. Yahoo processes 1.42TB in a minute. This chapter has the following sections. • Hadoop distributed filesystem • MapReduce framework • Setting the environment • Hadoop cluster modes • Running a MapReduce job with the MR1 framework • Running MR1 in standalone mode Figure 2-1. Apache Hadoop components

Cite

CITATION STYLE

APA

Vohra, D. (2016). HDFS and MapReduce. In Practical Hadoop Ecosystem (pp. 163–205). Apress. https://doi.org/10.1007/978-1-4842-2199-0_2

HDFS and MapReduce

Abstract

Cite

Register to see more suggestions