FlumeJava: easy, efficient data-parallel pipelines
MapReduce and similar systems significantly ease the task of writ- ing data-parallel code. However, many real-world computations re- quire a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java li- brary that makes it easy to develop, test, and run efficient data- parallel pipelines. At the core of the FlumeJava library are a cou- ple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run effi- ciently, FlumeJava defers their evaluation, instead internally con- structing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first op- timizes the execution plan, and then executes the optimized opera- tions on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and compu- tation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the effi- ciency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google.