Assisted design of data science pipelines

Sergey Redyuk; Zoi Kaoudi; Sebastian Schelter; Volker Markl

Journal ArticleOPEN ACCESS

Assisted design of data science pipelines

VLDB Journal (2024) 33(4) 1129-1153

DOI: 10.1007/s00778-024-00835-2

0Citations

8Readers

Abstract

When designing data science (DS) pipelines, end-users can get overwhelmed by the large and growing set of available data preprocessing and modeling techniques. Intelligent discovery assistants (IDAs) and automated machine learning (AutoML) solutions aim to facilitate end-users by (semi-)automating the process. However, they are expensive to compute and yield limited applicability for a wide range of real-world use cases and application domains. This is due to (a) their need to execute thousands of pipelines to get the optimal one, (b) their limited support of DS tasks, e.g., supervised classification or regression only, and a small, static set of available data preprocessing and ML algorithms; and (c) their restriction to quantifiable evaluation processes and metrics, e.g., tenfold cross-validation using the ROC AUC score for classification. To overcome these limitations, we propose a human-in-the-loop approach for the assisteddesignofdatasciencepipelines using previously executed pipelines. Based on a user query, i.e., data and a DS task, our framework outputs a ranked list of pipeline candidates from which the user can choose to execute or modify in real time. To recommend pipelines, it first identifies relevant datasets and pipelines utilizing efficient similarity search. It then ranks the candidate pipelines using multi-objective sorting and takes user interactions into account to improve suggestions over time. In our experimental evaluation, the proposed framework significantly outperforms the state-of-the-art IDA tool and achieves similar predictive performance with state-of-the-art long-running AutoML solutions while being real-time, generic to any evaluation processes and DS tasks, and extensible to new operators.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Redyuk, S., Kaoudi, Z., Schelter, S., & Markl, V. (2024). Assisted design of data science pipelines. VLDB Journal, 33(4), 1129–1153. https://doi.org/10.1007/s00778-024-00835-2

Readers' Seniority

Professor / Associate Prof. 1

33%

Lecturer / Post doc 1

33%

PhD / Post grad / Masters / Doc 1

33%

Readers' Discipline

Computer Science 2

67%

Arts and Humanities 1

33%

Assisted design of data science pipelines

Abstract

Author supplied keywords

References Powered by Scopus

The magical number seven, plus or minus two: some limits on our capacity for processing information

Multidimensional Binary Search Trees Used for Associative Searching

Case-Based reasoning: Foundational issues, methodological variations, and system approaches

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline