Assisted design of data science pipelines

0Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

When designing data science (DS) pipelines, end-users can get overwhelmed by the large and growing set of available data preprocessing and modeling techniques. Intelligent discovery assistants (IDAs) and automated machine learning (AutoML) solutions aim to facilitate end-users by (semi-)automating the process. However, they are expensive to compute and yield limited applicability for a wide range of real-world use cases and application domains. This is due to (a) their need to execute thousands of pipelines to get the optimal one, (b) their limited support of DS tasks, e.g., supervised classification or regression only, and a small, static set of available data preprocessing and ML algorithms; and (c) their restriction to quantifiable evaluation processes and metrics, e.g., tenfold cross-validation using the ROC AUC score for classification. To overcome these limitations, we propose a human-in-the-loop approach for the assisteddesignofdatasciencepipelines using previously executed pipelines. Based on a user query, i.e., data and a DS task, our framework outputs a ranked list of pipeline candidates from which the user can choose to execute or modify in real time. To recommend pipelines, it first identifies relevant datasets and pipelines utilizing efficient similarity search. It then ranks the candidate pipelines using multi-objective sorting and takes user interactions into account to improve suggestions over time. In our experimental evaluation, the proposed framework significantly outperforms the state-of-the-art IDA tool and achieves similar predictive performance with state-of-the-art long-running AutoML solutions while being real-time, generic to any evaluation processes and DS tasks, and extensible to new operators.

References Powered by Scopus

The magical number seven, plus or minus two: some limits on our capacity for processing information

15127Citations
N/AReaders
Get full text

Multidimensional Binary Search Trees Used for Associative Searching

5516Citations
N/AReaders
Get full text

Case-Based reasoning: Foundational issues, methodological variations, and system approaches

4789Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Redyuk, S., Kaoudi, Z., Schelter, S., & Markl, V. (2024). Assisted design of data science pipelines. VLDB Journal, 33(4), 1129–1153. https://doi.org/10.1007/s00778-024-00835-2

Readers' Seniority

Tooltip

Professor / Associate Prof. 1

33%

Lecturer / Post doc 1

33%

PhD / Post grad / Masters / Doc 1

33%

Readers' Discipline

Tooltip

Computer Science 2

67%

Arts and Humanities 1

33%

Save time finding and organizing research with Mendeley

Sign up for free