Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks

Yu Gan; Guiyang Liu; Xin Zhang; Qi Zhou; Jiesheng Wu; Jiangwei Jiang

Conference ProceedingsOPEN ACCESS

Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks

International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS (2024) 4 324-337

DOI: 10.1145/3623278.3624758

12Citations

16Readers

Get full text

Abstract

Cloud microservices are being scaled up due to the rising demand for new features and the convenience of cloud-native technologies. However, the growing scale of microservices complicates the remote procedure call (RPC) dependency graph, exacerbates the tail-of-scale effect, and makes many of the empirical rules for detecting the root cause of end-to-end performance issues unreliable. Additionally, existing open-source microservice benchmarks are too small to evaluate performance debugging algorithms at a production-scale with hundreds or even thousands of services and RPCs. To address these challenges, we present Sleuth, a trace-based root cause analysis (RCA) system for large-scale microservices using unsupervised graph learning. Sleuth leverages a graph neural network to capture the causal impact of each span in a trace, and trace clustering using a trace distance metric to reduce the amount of traces required for root cause localization. A pre-trained Sleuth model can be transferred to different microservice applications without any retraining or with few-shot fine-tuning. To quantitatively evaluate the performance and scalability of Sleuth, we propose a method to generate microservice benchmarks comparable to a production-scale. The experiments on the existing benchmark suites and synthetic large-scale microservices indicate that Sleuth has significantly outperformed the prior work in detection accuracy, performance, and adaptability on a large-scale deployment.

Cite

CITATION STYLE

APA

Gan, Y., Liu, G., Zhang, X., Zhou, Q., Wu, J., & Jiang, J. (2024). Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks. In International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS (Vol. 4, pp. 324–337). Association for Computing Machinery. https://doi.org/10.1145/3623278.3624758

Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks

Abstract

Cite

Register to see more suggestions