FlumeJava: easy, efficient data-p...
FlumeJava: Easy, Efficient Data-Parallel Pipelines Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, Nathan Weizenbaum Google, Inc. {chambers,raniwala,fjp,sra,rrh,robertwb,nweiz}@google.com Abstract MapReduce and similar systems significantly ease the task of writ- ing data-parallel code. However, many real-world computations re- quire a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java li- brary that makes it easy to develop, test, and run efficient data- parallel pipelines. At the core of the FlumeJava library are a cou- ple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run effi- ciently, FlumeJava defers their evaluation, instead internally con- structing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first op- timizes the execution plan, and then executes the optimized opera- tions on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and compu- tation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the effi- ciency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google. Categories and Subject Descriptors D.1.3 [Concurrent Pro- gramming]: Parallel Programming General Terms Algorithms, Languages, Performance Keywords data-parallel programming, MapReduce, Java 1. Introduction Building programs to process massive amounts of data in parallel can be very hard. MapReduce [6���8] greatly eased this task for data- parallel computations. It presented a simple abstraction to users for how to think about their computation, and it managed many of the difficult low-level tasks, such as distributing and coordinating the parallel work across many machines, and coping robustly with failures of machines, networks, and data. It has been used very successfully in practice by many developers. MapReduce���s success in this domain inspired the development of a number of related systems, including Hadoop [2], LINQ/Dryad [20], and Pig [3]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PLDI���10, June 5���10, 2010, Toronto, Ontario, Canada Copyright c 2010 ACM 978-1-4503-0019-3/10/06... $10.00 MapReduce works well for computations that can be broken down into a map step, a shuffle step, and a reduce step, but for many real-world computations, a chain of MapReduce stages is required. Such data-parallel pipelines require additional coordination code to chain together the separate MapReduce stages, and require addi- tional work to manage the creation and later deletion of the inter- mediate results between pipeline stages. The logical computation can become obscured by all these low-level coordination details, making it difficult for new developers to understand the computa- tion. Moreover, the division of the pipeline into particular stages becomes ���baked in��� to the code and difficult to change later if the logical computation needs to evolve. In this paper we present FlumeJava, a new system that aims to support the development of data-parallel pipelines. FlumeJava is a Java library centered around a few classes that represent parallel collections. Parallel collections support a modest number of par- allel operations which are composed to implement data-parallel computations. An entire pipeline, or even multiple pipelines, can be implemented in a single Java program using the FlumeJava ab- stractions there is no need to break up the logical computation into separate programs for each stage. FlumeJava���s parallel collections abstract away the details of how data is represented, including whether the data is represented as an in-memory data structure, as one or more files, or as an ex- ternal storage service such as a MySql database or a Bigtable [5]. Similarly, FlumeJava���s parallel operations abstract away their im- plementation strategy, such as whether an operation is implemented as a local sequential loop, or as a remote parallel MapReduce invo- cation, or (in the future) as a query on a database or as a streaming computation. These abstractions enable an entire pipeline to be ini- tially developed and tested on small in-memory test data, running in a single process, and debugged using standard Java IDEs and de- buggers, and then run completely unchanged over large production data. They also confer a degree of adaptability of the logical Flume- Java computations as new data storage mechanisms and execution services are developed. To achieve good performance, FlumeJava internally implements parallel operations using deferred evaluation. The invocation of a parallel operation does not actually run the operation, but instead simply records the operation and its arguments in an internal exe- cution plan graph structure. Once the execution plan for the whole computation has been constructed, FlumeJava optimizes the exe- cution plan, for example fusing chains of parallel operations to- gether into a small number of MapReduce operations. FlumeJava then runs the optimized execution plan. When running the exe- cution plan, FlumeJava chooses which strategy to use to imple- ment each operation (e.g., local sequential loop vs. remote parallel MapReduce, based in part on the size of the data being processed), places remote computations near the data they operate on, and per- 363
forms independent operations in parallel. FlumeJava also manages the creation and clean-up of any intermediate files needed within the computation. The optimized execution plan is typically sev- eral times faster than a MapReduce pipeline with the same logical structure, and approaches the performance achievable by an expe- rienced MapReduce programmer writing a hand-optimized chain of MapReduces, but with significantly less effort. The FlumeJava program is also easier to understand and change than the hand- optimized chain of MapReduces. As of March 2010, FlumeJava has been in use at Google for nearly a year, with 175 different users in the last month and many pipelines running in production. Anecdotal reports are that users find FlumeJava significantly easier to work with than MapReduce. Our main contributions are the following: ��� We have developed a Java library, based on a small set of composable primitives, that is both expressive and convenient. ��� We show how this API can be automatically transformed into an efficient execution plan, using deferred evaluation and opti- mizations such as fusion. ��� We have developed a run-time system for executing optimized plans that selects either local or parallel execution automatically and which manages many of the low-level details of running a pipeline. ��� We demonstrate through benchmarking that our system is ef- fective at transforming logical computations into efficient pro- grams. ��� Our system is in active use by many developers, and has pro- cessed petabytes of data. The next section of this paper gives some background on MapReduce. Section 3 presents the FlumeJava library from the user���s point of view. Section 4 describes the FlumeJava optimizer, and Section 5 describes the FlumeJava executor. Section 6 assesses our work, using both usage statistics and benchmark performance results. Section 7 compares our work to related systems. Section 8 concludes. 2. Background on MapReduce FlumeJava builds on the concepts and abstractions for data-parallel programming introduced by MapReduce. A MapReduce has three phases: 1. The Map phase starts by reading a collection of values or key/value pairs from an input source, such as a text file, binary record-oriented file, Bigtable, or MySql database. Large data sets are often represented by multiple, even thousands, of files (called shards), and multiple file shards can be read as a single logical input source. The Map phase then invokes a user-defined function, the Mapper, on each element, independently and in parallel. For each input element, the user-defined function emits zero or more key/value pairs, which are the outputs of the Map phase. Most MapReduces have a single (possibly sharded) input source and a single Mapper, but in general a single MapReduce can have multiple input sources and associated Mappers. 2. The Shuffle phase takes the key/value pairs emitted by the Mappers and groups together all the key/value pairs with the same key. It then outputs each distinct key and a stream of all the values with that key to the next phase. 3. The Reduce phase takes the key-grouped data emitted by the Shuffle phase and invokes a user-defined function, the Reducer, on each distinct key-and-values group, independently and in parallel. Each Reducer invocation is passed a key and an iterator over all the values associated with that key, and emits zero or more replacement values to associate with the input key. Oftentimes, the Reducer performs some kind of aggregation over all the values with a given key. For other MapReduces, the Reducer is just the identity function. The key/value pairs emitted from all the Reducer calls are then written to an output sink, e.g., a sharded file, Bigtable, or database. For Reducers that first combine all the values with a given key using an associative, commutative operation, a separate user- defined Combiner function can be specified to perform partial combining of values associated with a given key during the Map phase. Each Map worker will keep a cache of key/value pairs that have been emitted from the Mapper, and strive to combine locally as much as possible before sending the com- bined key/value pairs on to the Shuffle phase. The Reducer will typically complete the combining step, combining values from different Map workers. By default, the Shuffle phase sends each key-and-values group to a deterministically but randomly chosen Reduce worker ma- chine this choice determines which output file shard will hold that key���s results. Alternatively, a user-defined Sharder func- tion can be specified that selects which Reduce worker machine should receive the group for a given key. A user-defined Sharder can be used to aid in load balancing. It also can be used to sort the output keys into Reduce ���buckets,��� with all the keys of the ith Reduce worker being ordered before all the keys of the i+1st Reduce worker. Since each Reduce worker processes keys in lexicographic order, this kind of Sharder can be used to produce sorted output. Many physical machines can be used in parallel in each of these three phases. MapReduce automatically handles the low-level issues of se- lecting appropriate parallel worker machines, distributing to them the program to run, managing the temporary storage and flow of intermediate data between the three phases, and synchronizing the overall sequencing of the phases. MapReduce also automatically copes with transient failures of machines, networks, and software, which can be a huge and common challenge for distributed pro- grams run over hundreds of machines. The core of MapReduce is implemented in C++, but libraries exist that allow MapReduce to be invoked from other languages. For example, a Java version of MapReduce is implemented as a JNI veneer on top of the C++ version of MapReduce. MapReduce provides a framework into which parallel computa- tions are mapped. The Map phase supports embarrassingly parallel, element-wise computations. The Shuffle and Reduce phases sup- port cross-element computations, such as aggregations and group- ing. The art of programming using MapReduce mainly involves mapping the logical parallel computation into these basic opera- tions. Many computations can be expressed as a MapReduce, but many others require a sequence or graph of MapReduces. As the complexity of the logical computation grows, the challenge of map- ping it into a physical sequence of MapReduces increases. Higher- level concepts such as ���count the number of occurrences��� or ���join tables by key��� must be hand-compiled into lower-level MapReduce operations. In addition, the user takes on the additional burdens of writing a driver program to invoke the MapReduces in the proper sequence, managing the creation and deletion of intermediate files holding the data passed between MapReduces, and handling fail- ures across MapReduces. 3. The FlumeJava Library In this section we present the interface to the FlumeJava library, as seen by the FlumeJava user. The FlumeJava library aims to offer constructs that are close to those found in the user���s logical 364