Abstract
Stragglers are exceptionally slow tasks within a job that delay its completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. We present Hound, a statistical machine learning framework that infers the causes of stragglers from traces of datacenter-scale jobs. Hound is designed to achieve several objectives: datacenter-scale diagnosis, unbiased inference, interpretable models, and computational efficiency. We demonstrate Hound's capabilities for a production trace from Google's warehouse-scale datacenters and two Spark traces from Amazon EC2 clusters.
Cite
CITATION STYLE
Zheng, P., & Lee, B. C. (2019). Hound. ACM SIGMETRICS Performance Evaluation Review, 46(1), 59–61. https://doi.org/10.1145/3292040.3219641
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.