The paradigm of AutoML has created an opportunity to enable ML for the masses. Emerging industrial-scale cloud AutoML platforms aim to automate the end-to-end ML workflow. While many works have looked into automated feature engineering, model selection, or hyper-parameter search in AutoML, little work has studied a crucial step that serves as an entry point to this workflow: ML feature type inference. The semantic gap between attribute types (e.g., strings, numbers) in databases/files and ML feature types (e.g., Numeric, Categorical) necessitates type inference. In this work, we formalize and standardize this task by creating the first ever benchmark labeled dataset, which we use to objectively evaluate existing AutoML tools. Our dataset has 9921 examples and a 9-class label vocabulary. Our labeled data also offers an alternative approach to automate this task than existing rule-based or syntax-based approaches: use ML itself to predict feature types. We collate a benchmark suite of 30 classification and regression tasks to assess the importance of type inference for downstream models. Empirical comparison on our labeled data shows that an ML-based approach delivers a lift of an average 14% and up to 38% in accuracy for identifying feature types compared to prominent industrial tools. Our downstream benchmark suite reveals that the ML-based approach outperforms existing industrial-strength tools for 47 out of 60 downstream models. We release our labeled dataset, models, and downstream benchmarks in a public repository with a leaderboard.
CITATION STYLE
Shah, V., Lacanlale, J., Kumar, P., Yang, K., & Kumar, A. (2021). Towards Benchmarking Feature Type Inference for AutoML Platforms. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1584–1596). Association for Computing Machinery. https://doi.org/10.1145/3448016.3457274
Mendeley helps you to discover research relevant for your work.