Abstract
Context: Identifying the possible root causes of observed failures is crucial in microservice applications, as much as explaining how such possible root failures propagated across the microservices forming an application. This can indeed help pick countermeasures avoiding observed failures to happen again, e.g., by introducing circuit breakers or bulkheads avoiding the root failures to propagate and cause those observed. Objective: This paper aims at enabling to explain observed failures in microservice applications, either searching for all possible cascading failures or focusing only on those starting in a known root cause. Method: We propose a log-based root cause analysis technique, which declaratively determines the cascading failures that possibly caused an observed failure. We also enable exploiting our proposed technique in practice, by introducing a logging methodology to instrument applications to log their failures and service interactions, and by enabling to analyse such logs through yRCA, a prototype implementation of our proposed root cause analysis technique. Results: The practical usability of our proposed technique is assessed by means of a case study and controlled experiments. The case study shows the low effort for instrumenting a third-party application to produce the logs needed by our technique and its effectiveness in explaining injected failures. The controlled experiments further assess our technique's effectiveness and performances in explaining failures obtained with an existing chaos testbed. Conclusion: Our proposed technique can help to identify the cascading failures that possibly caused an observed failure in a microservice application. It can be used to determine all possible cascading failures, or to explain how cascading failures propagated from a known root cause (e.g., identified with some other existing root cause analyser).
Author supplied keywords
Cite
CITATION STYLE
Soldani, J., Forti, S., Roveroni, L., & Brogi, A. (2025). Explaining Microservices’ Cascading Failures From Their Logs. Software - Practice and Experience, 55(5), 809–828. https://doi.org/10.1002/spe.3400
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.