Home > Research > Publications & Outputs > Localizing Failure Root Causes in a Microservic...

Links

Text available via DOI:

View graph of relations

Localizing Failure Root Causes in a Microservice through Causality Inference

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published
  • Yuan Meng
  • Shenglin Zhang
  • Yongqian Sun
  • Ruru Zhang
  • Zhilong Hu
  • Yiyin Zhang
  • Chenyang Jia
  • Zhaogang Wang
  • Dan Pei
Close
Publication date6/10/2020
Host publication2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS)
PublisherIEEE
Pages1-10
Number of pages10
ISBN (electronic)9781728168876
ISBN (print)9781728168883
<mark>Original language</mark>English

Abstract

An increasing number of Internet applications are applying microservice architecture due to its flexibility and clear logic. The stability of microservice is thus vitally important for these applications' quality of service. Accurate failure root cause localization can help operators quickly recover microservice failures and mitigate loss. Although cross-microservice failure root cause localization has been well studied, how to localize failure root causes in a microservice so as to quickly mitigate this microservice has not yet been studied. In this work, we propose a framework, MicroCause, to accurately localize the root cause monitoring indicators in a microservice. MicroCause combines a simple yet effective path condition time series (PCTS) algorithm which accurately captures the sequential relationship of time series data, and a novel temporal cause oriented random walk (TCORW) method integrating the causal relationship, temporal order, and priority information of monitoring data. We evaluate MicroCause based on 86 real-world failure tickets collected from a top tier global online shopping service. Our experiments show that the top 5 accuracy (AC@5) of MicroCause for intra-microservice failure root cause localization is 98.7%, which is greatly higher (by 33.4 %) than the best baseline method.