The Power of Nested Parallelism in Big Data Processing A Hitting Three Flies with One Slap A

5Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Many common data analysis tasks, such as performing hyperparameter optimization, processing a partitioned graph, and treating a matrix as a vector of vectors, offer natural opportunities for nested-parallel operations, i.e., launching parallel operations from inside other parallel operations. However, state-of-the-art dataflow engines, such as Spark and Flink, do not support nested parallelism. Users must implement workarounds, causing orders of magnitude slowdowns for their tasks, let alone the implementation effort. We present Matryoshka, a system that enables dataflow engines to support nested parallelism, even in the presence of control flow statements at inner nesting levels. Matryoshka achieves this via a novel two-phase flattening process, which translates nested-parallel programs to flat-parallel programs that can efficiently run on existing dataflow engines. The first phase introduces novel nesting primitives into the code, which allows for dynamic optimizations based on intermediate data characteristics in the second phase at runtime. We validate our system using several common data analysis tasks, such as PageRank and K-means.

Cite

CITATION STYLE

APA

Gévay, G. E., Quiané-Ruiz, J. A., & Markl, V. (2021). The Power of Nested Parallelism in Big Data Processing A Hitting Three Flies with One Slap A. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 605–618). Association for Computing Machinery. https://doi.org/10.1145/3448016.3457287

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free