Home > Research > Publications & Outputs > Online diagnosis and recovery

Links

Text available via DOI:

View graph of relations

Online diagnosis and recovery: On the choice and impact of tuning parameters

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published

Standard

Online diagnosis and recovery: On the choice and impact of tuning parameters. / Serafini, M.; Bondavalli, A.; Suri, Neeraj.
In: IEEE Transactions on Dependable and Secure Computing, Vol. 4, No. 4, 12.11.2007, p. 295-312.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

Serafini, M, Bondavalli, A & Suri, N 2007, 'Online diagnosis and recovery: On the choice and impact of tuning parameters', IEEE Transactions on Dependable and Secure Computing, vol. 4, no. 4, pp. 295-312. https://doi.org/10.1109/TDSC.2007.70210

APA

Serafini, M., Bondavalli, A., & Suri, N. (2007). Online diagnosis and recovery: On the choice and impact of tuning parameters. IEEE Transactions on Dependable and Secure Computing, 4(4), 295-312. https://doi.org/10.1109/TDSC.2007.70210

Vancouver

Serafini M, Bondavalli A, Suri N. Online diagnosis and recovery: On the choice and impact of tuning parameters. IEEE Transactions on Dependable and Secure Computing. 2007 Nov 12;4(4):295-312. doi: 10.1109/TDSC.2007.70210

Author

Serafini, M. ; Bondavalli, A. ; Suri, Neeraj. / Online diagnosis and recovery : On the choice and impact of tuning parameters. In: IEEE Transactions on Dependable and Secure Computing. 2007 ; Vol. 4, No. 4. pp. 295-312.

Bibtex

@article{2ed36d2681b8486f82fe1ebffac7b831,
title = "Online diagnosis and recovery: On the choice and impact of tuning parameters",
abstract = "A sequenced process of Fault Detection followed by the erroneous node's Isolation and system Reconfiguration (node exclusion or recovery), that is, the FDIR process, characterizes the sustained operations of a fault-tolerant system. For distributed systems utilizing message passing, a number of diagnostic (and associated FDIR) approaches, including our prior algorithms, exist in literature and practice. Invariably, the focus is on proving the completeness and correctness (all and only the faulty nodes are isolated) for the chosen fault model, without explicitly segregating permanent from transient faulty nodes. To capture diagnostic issues related to the persistence of errors (transient, intermittent, and permanent), we advocate the integration of count-and-threshold mechanisms into the FDIR framework. Targeting pragmatic system issues, we develop an adaptive online FDIR framework that handles a continuum of fault models and diagnostic protocols and comprehensively characterizes the role of various probabilistic parameters that, due to the count-and-threshold approach, influence the correctness and completeness of diagnosis and system reliability such as the fault detection frequency. The FDIR framework has been implemented on two prototypes for automotive and aerospace applications. The tuning of the protocol parameters at design time allows a significant improvement with respect to prior design choices. {\textcopyright} 2007 IEEE.",
keywords = "Error detection, Online diagnosis, Recovery, System reliability, Transient faults, Diagnostic protocols, System Reconfiguration, Transient faulty nodes, Algorithms, Distributed computer systems, Fault detection, Fault tolerant computer systems, Online systems, Computer aided diagnosis",
author = "M. Serafini and A. Bondavalli and Neeraj Suri",
year = "2007",
month = nov,
day = "12",
doi = "10.1109/TDSC.2007.70210",
language = "English",
volume = "4",
pages = "295--312",
journal = "IEEE Transactions on Dependable and Secure Computing",
issn = "1545-5971",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "4",

}

RIS

TY - JOUR

T1 - Online diagnosis and recovery

T2 - On the choice and impact of tuning parameters

AU - Serafini, M.

AU - Bondavalli, A.

AU - Suri, Neeraj

PY - 2007/11/12

Y1 - 2007/11/12

N2 - A sequenced process of Fault Detection followed by the erroneous node's Isolation and system Reconfiguration (node exclusion or recovery), that is, the FDIR process, characterizes the sustained operations of a fault-tolerant system. For distributed systems utilizing message passing, a number of diagnostic (and associated FDIR) approaches, including our prior algorithms, exist in literature and practice. Invariably, the focus is on proving the completeness and correctness (all and only the faulty nodes are isolated) for the chosen fault model, without explicitly segregating permanent from transient faulty nodes. To capture diagnostic issues related to the persistence of errors (transient, intermittent, and permanent), we advocate the integration of count-and-threshold mechanisms into the FDIR framework. Targeting pragmatic system issues, we develop an adaptive online FDIR framework that handles a continuum of fault models and diagnostic protocols and comprehensively characterizes the role of various probabilistic parameters that, due to the count-and-threshold approach, influence the correctness and completeness of diagnosis and system reliability such as the fault detection frequency. The FDIR framework has been implemented on two prototypes for automotive and aerospace applications. The tuning of the protocol parameters at design time allows a significant improvement with respect to prior design choices. © 2007 IEEE.

AB - A sequenced process of Fault Detection followed by the erroneous node's Isolation and system Reconfiguration (node exclusion or recovery), that is, the FDIR process, characterizes the sustained operations of a fault-tolerant system. For distributed systems utilizing message passing, a number of diagnostic (and associated FDIR) approaches, including our prior algorithms, exist in literature and practice. Invariably, the focus is on proving the completeness and correctness (all and only the faulty nodes are isolated) for the chosen fault model, without explicitly segregating permanent from transient faulty nodes. To capture diagnostic issues related to the persistence of errors (transient, intermittent, and permanent), we advocate the integration of count-and-threshold mechanisms into the FDIR framework. Targeting pragmatic system issues, we develop an adaptive online FDIR framework that handles a continuum of fault models and diagnostic protocols and comprehensively characterizes the role of various probabilistic parameters that, due to the count-and-threshold approach, influence the correctness and completeness of diagnosis and system reliability such as the fault detection frequency. The FDIR framework has been implemented on two prototypes for automotive and aerospace applications. The tuning of the protocol parameters at design time allows a significant improvement with respect to prior design choices. © 2007 IEEE.

KW - Error detection

KW - Online diagnosis

KW - Recovery

KW - System reliability

KW - Transient faults

KW - Diagnostic protocols

KW - System Reconfiguration

KW - Transient faulty nodes

KW - Algorithms

KW - Distributed computer systems

KW - Fault detection

KW - Fault tolerant computer systems

KW - Online systems

KW - Computer aided diagnosis

U2 - 10.1109/TDSC.2007.70210

DO - 10.1109/TDSC.2007.70210

M3 - Journal article

VL - 4

SP - 295

EP - 312

JO - IEEE Transactions on Dependable and Secure Computing

JF - IEEE Transactions on Dependable and Secure Computing

SN - 1545-5971

IS - 4

ER -