Predicate Pushdown for Data Science Pipelines

Cong Yan; Yin Lin; Yeye He

Journal ArticleOPEN ACCESS

Predicate Pushdown for Data Science Pipelines

Yan C
Lin Y
He Y

Proceedings of the ACM on Management of Data (2023) 1(2) 1-28

DOI: 10.1145/3589281

N/ACitations

13Readers

Abstract

Predicate pushdown is a widely adopted query optimization. Existing systems and prior work mostly use pattern-matching rules to decide when a predicate can be pushed through certain operators like join or groupby. However, challenges arise in optimizing for data science pipelines due to the widely used non-relational operators and user-defined functions (UDF) that existing rules would fail to cover. In this paper, we present MagicPush, which decides predicate pushdown using a search-verification approach.MagicPush searches for candidate predicates on pipeline input, which is often not the same as the predicate to be pushed down, and verifies that the pushdown does not change pipeline output with full correctness guarantees. Our evaluation on TPC-H queries and 200 real-world pipelines sampled from GitHub Notebooks shows that MagicPush substantially outperforms a strong baseline that uses a union of rules from prior work - it is able to discover new pushdown opportunities and better optimize 42 real-world pipelines with up to 99% reduction in running time, while discovering all pushdown opportunities found by the existing baseline on remaining cases.

Cite

CITATION STYLE

APA

Yan, C., Lin, Y., & He, Y. (2023). Predicate Pushdown for Data Science Pipelines. Proceedings of the ACM on Management of Data, 1(2), 1–28. https://doi.org/10.1145/3589281

Predicate Pushdown for Data Science Pipelines

Abstract

Cite

Register to see more suggestions