Home > Research > Publications & Outputs > Localizing Failure Root Causes in a Microservic...

Links

Text available via DOI:

View graph of relations

Localizing Failure Root Causes in a Microservice through Causality Inference

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

Localizing Failure Root Causes in a Microservice through Causality Inference. / Meng, Yuan; Zhang, Shenglin; Sun, Yongqian et al.
2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, 2020. p. 1-10.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Meng, Y, Zhang, S, Sun, Y, Zhang, R, Hu, Z, Zhang, Y, Jia, C, Wang, Z & Pei, D 2020, Localizing Failure Root Causes in a Microservice through Causality Inference. in 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, pp. 1-10. https://doi.org/10.1109/IWQoS49365.2020.9213058

APA

Meng, Y., Zhang, S., Sun, Y., Zhang, R., Hu, Z., Zhang, Y., Jia, C., Wang, Z., & Pei, D. (2020). Localizing Failure Root Causes in a Microservice through Causality Inference. In 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS) (pp. 1-10). IEEE. https://doi.org/10.1109/IWQoS49365.2020.9213058

Vancouver

Meng Y, Zhang S, Sun Y, Zhang R, Hu Z, Zhang Y et al. Localizing Failure Root Causes in a Microservice through Causality Inference. In 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE. 2020. p. 1-10 Epub 2020 Jun 15. doi: 10.1109/IWQoS49365.2020.9213058

Author

Meng, Yuan ; Zhang, Shenglin ; Sun, Yongqian et al. / Localizing Failure Root Causes in a Microservice through Causality Inference. 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, 2020. pp. 1-10

Bibtex

@inproceedings{882650427914403abfd01ca2da569192,
title = "Localizing Failure Root Causes in a Microservice through Causality Inference",
abstract = "An increasing number of Internet applications are applying microservice architecture due to its flexibility and clear logic. The stability of microservice is thus vitally important for these applications' quality of service. Accurate failure root cause localization can help operators quickly recover microservice failures and mitigate loss. Although cross-microservice failure root cause localization has been well studied, how to localize failure root causes in a microservice so as to quickly mitigate this microservice has not yet been studied. In this work, we propose a framework, MicroCause, to accurately localize the root cause monitoring indicators in a microservice. MicroCause combines a simple yet effective path condition time series (PCTS) algorithm which accurately captures the sequential relationship of time series data, and a novel temporal cause oriented random walk (TCORW) method integrating the causal relationship, temporal order, and priority information of monitoring data. We evaluate MicroCause based on 86 real-world failure tickets collected from a top tier global online shopping service. Our experiments show that the top 5 accuracy (AC@5) of MicroCause for intra-microservice failure root cause localization is 98.7%, which is greatly higher (by 33.4 %) than the best baseline method.",
author = "Yuan Meng and Shenglin Zhang and Yongqian Sun and Ruru Zhang and Zhilong Hu and Yiyin Zhang and Chenyang Jia and Zhaogang Wang and Dan Pei",
year = "2020",
month = oct,
day = "6",
doi = "10.1109/IWQoS49365.2020.9213058",
language = "English",
isbn = "9781728168883",
pages = "1--10",
booktitle = "2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS)",
publisher = "IEEE",

}

RIS

TY - GEN

T1 - Localizing Failure Root Causes in a Microservice through Causality Inference

AU - Meng, Yuan

AU - Zhang, Shenglin

AU - Sun, Yongqian

AU - Zhang, Ruru

AU - Hu, Zhilong

AU - Zhang, Yiyin

AU - Jia, Chenyang

AU - Wang, Zhaogang

AU - Pei, Dan

PY - 2020/10/6

Y1 - 2020/10/6

N2 - An increasing number of Internet applications are applying microservice architecture due to its flexibility and clear logic. The stability of microservice is thus vitally important for these applications' quality of service. Accurate failure root cause localization can help operators quickly recover microservice failures and mitigate loss. Although cross-microservice failure root cause localization has been well studied, how to localize failure root causes in a microservice so as to quickly mitigate this microservice has not yet been studied. In this work, we propose a framework, MicroCause, to accurately localize the root cause monitoring indicators in a microservice. MicroCause combines a simple yet effective path condition time series (PCTS) algorithm which accurately captures the sequential relationship of time series data, and a novel temporal cause oriented random walk (TCORW) method integrating the causal relationship, temporal order, and priority information of monitoring data. We evaluate MicroCause based on 86 real-world failure tickets collected from a top tier global online shopping service. Our experiments show that the top 5 accuracy (AC@5) of MicroCause for intra-microservice failure root cause localization is 98.7%, which is greatly higher (by 33.4 %) than the best baseline method.

AB - An increasing number of Internet applications are applying microservice architecture due to its flexibility and clear logic. The stability of microservice is thus vitally important for these applications' quality of service. Accurate failure root cause localization can help operators quickly recover microservice failures and mitigate loss. Although cross-microservice failure root cause localization has been well studied, how to localize failure root causes in a microservice so as to quickly mitigate this microservice has not yet been studied. In this work, we propose a framework, MicroCause, to accurately localize the root cause monitoring indicators in a microservice. MicroCause combines a simple yet effective path condition time series (PCTS) algorithm which accurately captures the sequential relationship of time series data, and a novel temporal cause oriented random walk (TCORW) method integrating the causal relationship, temporal order, and priority information of monitoring data. We evaluate MicroCause based on 86 real-world failure tickets collected from a top tier global online shopping service. Our experiments show that the top 5 accuracy (AC@5) of MicroCause for intra-microservice failure root cause localization is 98.7%, which is greatly higher (by 33.4 %) than the best baseline method.

U2 - 10.1109/IWQoS49365.2020.9213058

DO - 10.1109/IWQoS49365.2020.9213058

M3 - Conference contribution/Paper

SN - 9781728168883

SP - 1

EP - 10

BT - 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS)

PB - IEEE

ER -