Machine Learning (ML) has been growing in popularity in multiple areas and groups at CERN, covering fast simulation, tracking, anomaly detection, among many others. We describe a new service available at CERN, based on Kubeflow and managing the full ML lifecycle: data preparation and interactive analysis, large scale distributed model training and model serving. We cover specific features available for hyper-parameter tuning and model metadata management, as well as infrastructure details to integrate accelerators and external resources. We also present results and a cost evaluation from scaling out a popular ML use case using public cloud resources, achieving close to linear scaling when using a large number of GPUs.
CITATION STYLE
Golubovic, D., & Rocha, R. (2021). Training and Serving ML workloads with Kubeflow at CERN. EPJ Web of Conferences, 251, 02067. https://doi.org/10.1051/epjconf/202125102067
Mendeley helps you to discover research relevant for your work.