SAPIENTML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions

Ripon K. Saha; Akira Ura; Sonal Mahajan; Chenguang Zhu; Linyi Li; Yang Hu; Hiroaki Yoshida; Sarfraz Khurshid; Mukul R. Prasad

Conference ProceedingsOPEN ACCESS

SAPIENTML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions

Proceedings - International Conference on Software Engineering (2022) 2022-May 1932-1944

DOI: 10.1145/3510003.3510226

5Citations

20Readers

Get full text

Abstract

Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses meta-learning to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using a pipeline dataflow model derived from the corpus. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1,094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 4 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances. This difference is amplified on the 10 most challenging benchmarks, where SapientML wins on 9 instances with the other tools failing to produce pipelines on 4 or more benchmarks.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Saha, R. K., Ura, A., Mahajan, S., Zhu, C., Li, L., Hu, Y., … Prasad, M. R. (2022). SAPIENTML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions. In Proceedings - International Conference on Software Engineering (Vol. 2022-May, pp. 1932–1944). IEEE Computer Society. https://doi.org/10.1145/3510003.3510226

Readers over time

Readers' Seniority

PhD / Post grad / Masters / Doc 7

70%

Professor / Associate Prof. 1

10%

Lecturer / Post doc 1

10%

Researcher 1

10%

Readers' Discipline

Computer Science 10

83%

Engineering 1

Arts and Humanities 1

SAPIENTML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions

Abstract

Author supplied keywords

References Powered by Scopus

No free lunch theorems for optimization

A survey on Image Data Augmentation for Deep Learning

Sequential model-based optimization for general algorithm configuration

Cited by Powered by Scopus

AutoML from Software Engineering Perspective: Landscapes and Challenges

On taking advantage of opportunistic meta-knowledge to reduce configuration spaces for automated machine learning

Enhancing Program Synthesis with Large Language Models Using Many-Objective Grammar-Guided Genetic Programming

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline