Home > Research > Publications & Outputs > Failure Diagnosis for Cluster Systems using Par...

Electronic data

  • camera_ready_paper_ID_297

    Rights statement: ©2021 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

    Accepted author manuscript, 809 KB, PDF document

    Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Links

Text available via DOI:

View graph of relations

Failure Diagnosis for Cluster Systems using Partial Correlations

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

Failure Diagnosis for Cluster Systems using Partial Correlations. / Chuah, Edward; Jhumka, Arshad; Alt, Samantha et al.
2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, 2021. p. 1091-1101.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Chuah, E, Jhumka, A, Alt, S, Evans, RT & Suri, N 2021, Failure Diagnosis for Cluster Systems using Partial Correlations. in 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, pp. 1091-1101, The 19th IEEE International Symposium on Parallel and Distributed Processing with Applications, New York, United States, 1/10/21. https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151

APA

Chuah, E., Jhumka, A., Alt, S., Evans, R. T., & Suri, N. (2021). Failure Diagnosis for Cluster Systems using Partial Correlations. In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) (pp. 1091-1101). IEEE. https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151

Vancouver

Chuah E, Jhumka A, Alt S, Evans RT, Suri N. Failure Diagnosis for Cluster Systems using Partial Correlations. In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE. 2021. p. 1091-1101 Epub 2021 Oct 30. doi: 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151

Author

Chuah, Edward ; Jhumka, Arshad ; Alt, Samantha et al. / Failure Diagnosis for Cluster Systems using Partial Correlations. 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, 2021. pp. 1091-1101

Bibtex

@inproceedings{d20361c062734f4387e910c914881e4a,
title = "Failure Diagnosis for Cluster Systems using Partial Correlations",
abstract = "Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effective diagnosis of system failures is desired to help improve system reliability from both a remedial and preventive perspective. As HPC systems conduct extensive logging of resource usage and system events,parsing this data is an oft advocated basis for failure diagnosis. However, the high levels of concurrency that exist in HPC systems cause system events to frequently interleave in time and, as such, certain interactions appear or become indirect. which will be missed by current failure diagnostics techniques. To help uncover such indirect interactions, in this paper, we develop a novel approach that leverages the concept of partial correlation. The novel failure diagnostics workflow - called IFADE - extracts partial correlation of resource use counters and partial correlation of system errors. As part of our contributions, we (a) compare our diagnostics approach with current ones, (b) identify two previouslyunknown causes of system failures, validated by system designers and (c) provide insights into Lustre I/O and segmentation faults. IFADE has been put on the public domain to support system administrators in failure diagnosis.",
author = "Edward Chuah and Arshad Jhumka and Samantha Alt and Evans, {R. Todd} and Neeraj Suri",
note = "{\textcopyright}2021 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. ; The 19th IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA ; Conference date: 01-10-2021 Through 03-10-2021",
year = "2021",
month = dec,
day = "22",
doi = "10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151",
language = "English",
isbn = "9781665411936",
pages = "1091--1101",
booktitle = "2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)",
publisher = "IEEE",
url = "http://www.cloud-conf.net/ispa2021/",

}

RIS

TY - GEN

T1 - Failure Diagnosis for Cluster Systems using Partial Correlations

AU - Chuah, Edward

AU - Jhumka, Arshad

AU - Alt, Samantha

AU - Evans, R. Todd

AU - Suri, Neeraj

N1 - ©2021 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

PY - 2021/12/22

Y1 - 2021/12/22

N2 - Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effective diagnosis of system failures is desired to help improve system reliability from both a remedial and preventive perspective. As HPC systems conduct extensive logging of resource usage and system events,parsing this data is an oft advocated basis for failure diagnosis. However, the high levels of concurrency that exist in HPC systems cause system events to frequently interleave in time and, as such, certain interactions appear or become indirect. which will be missed by current failure diagnostics techniques. To help uncover such indirect interactions, in this paper, we develop a novel approach that leverages the concept of partial correlation. The novel failure diagnostics workflow - called IFADE - extracts partial correlation of resource use counters and partial correlation of system errors. As part of our contributions, we (a) compare our diagnostics approach with current ones, (b) identify two previouslyunknown causes of system failures, validated by system designers and (c) provide insights into Lustre I/O and segmentation faults. IFADE has been put on the public domain to support system administrators in failure diagnosis.

AB - Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effective diagnosis of system failures is desired to help improve system reliability from both a remedial and preventive perspective. As HPC systems conduct extensive logging of resource usage and system events,parsing this data is an oft advocated basis for failure diagnosis. However, the high levels of concurrency that exist in HPC systems cause system events to frequently interleave in time and, as such, certain interactions appear or become indirect. which will be missed by current failure diagnostics techniques. To help uncover such indirect interactions, in this paper, we develop a novel approach that leverages the concept of partial correlation. The novel failure diagnostics workflow - called IFADE - extracts partial correlation of resource use counters and partial correlation of system errors. As part of our contributions, we (a) compare our diagnostics approach with current ones, (b) identify two previouslyunknown causes of system failures, validated by system designers and (c) provide insights into Lustre I/O and segmentation faults. IFADE has been put on the public domain to support system administrators in failure diagnosis.

U2 - 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151

DO - 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151

M3 - Conference contribution/Paper

SN - 9781665411936

SP - 1091

EP - 1101

BT - 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)

PB - IEEE

T2 - The 19th IEEE International Symposium on Parallel and Distributed Processing with Applications

Y2 - 1 October 2021 through 3 October 2021

ER -