Failure Diagnosis for Cluster Systems using Partial Correlations

Computing and Communications

Associated organisational units

Electronic data

camera_ready_paper_ID_297
Rights statement: ©2021 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Accepted author manuscript, 809 KB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Text available via DOI:

https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151
Final published version

View graph of relations

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Edward Chuah
Arshad Jhumka
Samantha Alt
R. Todd Evans
Neeraj Suri

More...

Publication date	22/12/2021
Host publication	2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)
Publisher	IEEE
Pages	1091-1101
Number of pages	11
ISBN (electronic)	9781665435741
ISBN (print)	9781665411936
<mark>Original language</mark>	English
Event	The 19th IEEE International Symposium on Parallel and Distributed Processing with Applications - , United States Duration: 1/10/2021 → 3/10/2021 http://www.cloud-conf.net/ispa2021/

Conference

Conference	The 19th IEEE International Symposium on Parallel and Distributed Processing with Applications
Abbreviated title	ISPA
Country/Territory	United States
Period	1/10/21 → 3/10/21
Internet address	http://www.cloud-conf.net/ispa2021/

Conference

Conference	The 19th IEEE International Symposium on Parallel and Distributed Processing with Applications
Abbreviated title	ISPA
Country/Territory	United States
Period	1/10/21 → 3/10/21
Internet address	http://www.cloud-conf.net/ispa2021/

Abstract

Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effective diagnosis of system failures is desired to help improve system reliability from both a remedial and preventive perspective. As HPC systems conduct extensive logging of resource usage and system events,
parsing this data is an oft advocated basis for failure diagnosis. However, the high levels of concurrency that exist in HPC systems cause system events to frequently interleave in time and, as such, certain interactions appear or become indirect. which will be missed by current failure diagnostics techniques. To help uncover such indirect interactions, in this paper, we develop a novel approach that leverages the concept of partial correlation. The novel failure diagnostics workflow - called IFADE - extracts partial correlation of resource use counters and partial correlation of system errors. As part of our contributions, we (a) compare our diagnostics approach with current ones, (b) identify two previously
unknown causes of system failures, validated by system designers and (c) provide insights into Lustre I/O and segmentation faults. IFADE has been put on the public domain to support system administrators in failure diagnosis.

Bibliographic note

©2021 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Research

Associated organisational units

Electronic data

Links

Text available via DOI: