Abstract
Online inference is becoming a key service product for many businesses, deployed in cloud platforms to meet customer demands. Despite their revenue-generation capability, these services need to operate under tight Quality-of-Service (QoS) and cost budget constraints. This paper introduces KAIROS, a novel runtime framework that maximizes the query throughput while meeting QoS target and a cost budget. KAIROS designs and implements novel techniques to build a pool of heterogeneous compute hardware without online exploration overhead, and distribute inference queries optimally at runtime. Our evaluation using industry-grade machine learning (ML) models shows that KAIROS yields up to 2x the throughput of an optimal homogeneous solution, and outperforms state-of-the-art schemes by up to 70%, despite advantageous implementations of the competing schemes to ignore their exploration overhead.
Author supplied keywords
Cite
CITATION STYLE
Li, B., Samsi, S., Gadepally, V., & Tiwari, D. (2023). Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources. In HPDC 2023 - Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing (pp. 3–16). Association for Computing Machinery, Inc. https://doi.org/10.1145/3588195.3592997
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.