A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems

Computing and Communications

Associated organisational unit

Security Lancaster

Text available via DOI:

https://doi.org/10.1109/ACCESS.2022.3231454
Final published version
Available under license: CC BY: Creative Commons Attribution 4.0 International License

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems. / Chuah, Edward; Jhumka, Arshad; Malek, Miroslaw et al.
In: IEEE Access, Vol. 10, 29.12.2022, p. 133487-133503.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Chuah, E, Jhumka, A, Malek, M & Suri, N 2022, 'A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems', IEEE Access, vol. 10, pp. 133487-133503. https://doi.org/10.1109/ACCESS.2022.3231454

APA

Chuah, E., Jhumka, A., Malek, M., & Suri, N. (2022). A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems. IEEE Access, 10, 133487-133503. https://doi.org/10.1109/ACCESS.2022.3231454

Vancouver

Chuah E, Jhumka A, Malek M, Suri N. A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems. IEEE Access. 2022 Dec 29;10:133487-133503. Epub 2022 Dec 21. doi: 10.1109/ACCESS.2022.3231454

Author

Chuah, Edward ; Jhumka, Arshad ; Malek, Miroslaw et al. / A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems. In: IEEE Access. 2022 ; Vol. 10. pp. 133487-133503.

Bibtex

@article{7d921b46774246e4ab018f50ee13574e,

title = "A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems",

abstract = "System logs are the rst source of information available to system designers to analyze and troubleshoot their cluster systems. For example, High-Performance Computing (HPC) systems generate alarge volume of heterogeneous data from multiple sub-systems, so the idea of using a single source of data to achieve a given goal, such as identication of failures, is losing its validity. System log-analysis tools assist system designers gain understanding into a large volume of system logs. They enable system designers toperform various analyses (e.g., diagnosing node failures or predicting node failures). Current system log-analysis tools vary signicantly in their function and design.We conduct a systematic review of literature on system log-analysis tools and select 46 representative articles out of 3,758 initial articles. To the best of our knowledge, there is no work that studied the characteristics of log-correlation tools (LogCTs) with respect to four quality attributes including (a) spurious correlations, (b) correlation threshold settings, (c) outliers in the data and (d) missing data. In this paper, we (a) propose a quality model to evaluate LogCTs and(b) use this quality model to evaluate and recommend current LogCTs. Through our review, we (a) identify papers on LogCTs, (b) build a quality model consisting of the four quality attributes and (c) discuss several open challenges for future research. Our study highlights the advantages and limitations of existing LogCTs and identies research opportunities that could facilitate better failure handling in large cluster systems.",

author = "Edward Chuah and Arshad Jhumka and Miroslaw Malek and Neeraj Suri",

year = "2022",

month = dec,

day = "29",

doi = "10.1109/ACCESS.2022.3231454",

language = "English",

volume = "10",

pages = "133487--133503",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

RIS

TY - JOUR

T1 - A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems

AU - Chuah, Edward

AU - Jhumka, Arshad

AU - Malek, Miroslaw

AU - Suri, Neeraj

PY - 2022/12/29

Y1 - 2022/12/29

N2 - System logs are the rst source of information available to system designers to analyze and troubleshoot their cluster systems. For example, High-Performance Computing (HPC) systems generate alarge volume of heterogeneous data from multiple sub-systems, so the idea of using a single source of data to achieve a given goal, such as identication of failures, is losing its validity. System log-analysis tools assist system designers gain understanding into a large volume of system logs. They enable system designers toperform various analyses (e.g., diagnosing node failures or predicting node failures). Current system log-analysis tools vary signicantly in their function and design.We conduct a systematic review of literature on system log-analysis tools and select 46 representative articles out of 3,758 initial articles. To the best of our knowledge, there is no work that studied the characteristics of log-correlation tools (LogCTs) with respect to four quality attributes including (a) spurious correlations, (b) correlation threshold settings, (c) outliers in the data and (d) missing data. In this paper, we (a) propose a quality model to evaluate LogCTs and(b) use this quality model to evaluate and recommend current LogCTs. Through our review, we (a) identify papers on LogCTs, (b) build a quality model consisting of the four quality attributes and (c) discuss several open challenges for future research. Our study highlights the advantages and limitations of existing LogCTs and identies research opportunities that could facilitate better failure handling in large cluster systems.

AB - System logs are the rst source of information available to system designers to analyze and troubleshoot their cluster systems. For example, High-Performance Computing (HPC) systems generate alarge volume of heterogeneous data from multiple sub-systems, so the idea of using a single source of data to achieve a given goal, such as identication of failures, is losing its validity. System log-analysis tools assist system designers gain understanding into a large volume of system logs. They enable system designers toperform various analyses (e.g., diagnosing node failures or predicting node failures). Current system log-analysis tools vary signicantly in their function and design.We conduct a systematic review of literature on system log-analysis tools and select 46 representative articles out of 3,758 initial articles. To the best of our knowledge, there is no work that studied the characteristics of log-correlation tools (LogCTs) with respect to four quality attributes including (a) spurious correlations, (b) correlation threshold settings, (c) outliers in the data and (d) missing data. In this paper, we (a) propose a quality model to evaluate LogCTs and(b) use this quality model to evaluate and recommend current LogCTs. Through our review, we (a) identify papers on LogCTs, (b) build a quality model consisting of the four quality attributes and (c) discuss several open challenges for future research. Our study highlights the advantages and limitations of existing LogCTs and identies research opportunities that could facilitate better failure handling in large cluster systems.

U2 - 10.1109/ACCESS.2022.3231454

DO - 10.1109/ACCESS.2022.3231454

M3 - Journal article

VL - 10

SP - 133487

EP - 133503

JO - IEEE Access

JF - IEEE Access

SN - 2169-3536

ER -

Research

Associated organisational unit

Links

Text available via DOI: