Home > Research > Publications & Outputs > Emergent Failures

Electronic data

  • Emergent Failures

    Rights statement: ©2018 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

    Accepted author manuscript, 1.22 MB, PDF document

    Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Links

Text available via DOI:

View graph of relations

Emergent Failures: Rethinking Cloud Reliability at Scale

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published

Standard

Emergent Failures: Rethinking Cloud Reliability at Scale. / Garraghan, Peter; Yang, Renyu; Wen, Zhenyu et al.
In: IEEE Cloud Computing, Vol. 5, No. 5, 18.10.2018, p. 12-21.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

Garraghan, P, Yang, R, Wen, Z, Romanovsky, A, Xu, J, Buyya, R & Ranjan, R 2018, 'Emergent Failures: Rethinking Cloud Reliability at Scale', IEEE Cloud Computing, vol. 5, no. 5, pp. 12-21. https://doi.org/10.1109/MCC.2018.053711662

APA

Garraghan, P., Yang, R., Wen, Z., Romanovsky, A., Xu, J., Buyya, R., & Ranjan, R. (2018). Emergent Failures: Rethinking Cloud Reliability at Scale. IEEE Cloud Computing, 5(5), 12-21. https://doi.org/10.1109/MCC.2018.053711662

Vancouver

Garraghan P, Yang R, Wen Z, Romanovsky A, Xu J, Buyya R et al. Emergent Failures: Rethinking Cloud Reliability at Scale. IEEE Cloud Computing. 2018 Oct 18;5(5):12-21. doi: 10.1109/MCC.2018.053711662

Author

Garraghan, Peter ; Yang, Renyu ; Wen, Zhenyu et al. / Emergent Failures : Rethinking Cloud Reliability at Scale. In: IEEE Cloud Computing. 2018 ; Vol. 5, No. 5. pp. 12-21.

Bibtex

@article{a790999654954f8e8905cfe79abaa7e3,
title = "Emergent Failures: Rethinking Cloud Reliability at Scale",
abstract = "Since the conception of cloud computing, ensuring its ability to provide highly reliable service has been of the upmost importance and criticality to the business objectives of providers and their customers. This has held true for every facet of the system, encompassing applications, resource management, the underlying computing infrastructure, and environmental cooling. Thus, the cloud-computing and dependability research communities have exerted considerable effort toward enhancing the reliability of system components against various software and hardware failures. However, as these systems have continued to grow in scale, with heterogeneity and complexity resulting in the manifestation of emergent behavior, so too have their respective failures. Recent studies of production cloud datacenters indicate the existence of complex failure manifestations that existing fault tolerance and recovery strategies are ill-equipped to effectively handle. These strategies can even be responsible for such failures. These emergent failures-frequently transient and identifiable only at runtime-represent a significant threat to designing reliable cloud systems. This article identifies the challenges of emergent failures in cloud datacenters at scale and their impact on system resource management, and discusses potential directions of further study for Internet of Things integration and holistic fault tolerance.",
author = "Peter Garraghan and Renyu Yang and Zhenyu Wen and Alexander Romanovsky and Jie Xu and Rajkumar Buyya and Rajiv Ranjan",
note = "{\textcopyright}2018 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.",
year = "2018",
month = oct,
day = "18",
doi = "10.1109/MCC.2018.053711662",
language = "English",
volume = "5",
pages = "12--21",
journal = "IEEE Cloud Computing",
issn = "2325-6095",
publisher = "IEEE",
number = "5",

}

RIS

TY - JOUR

T1 - Emergent Failures

T2 - Rethinking Cloud Reliability at Scale

AU - Garraghan, Peter

AU - Yang, Renyu

AU - Wen, Zhenyu

AU - Romanovsky, Alexander

AU - Xu, Jie

AU - Buyya, Rajkumar

AU - Ranjan, Rajiv

N1 - ©2018 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

PY - 2018/10/18

Y1 - 2018/10/18

N2 - Since the conception of cloud computing, ensuring its ability to provide highly reliable service has been of the upmost importance and criticality to the business objectives of providers and their customers. This has held true for every facet of the system, encompassing applications, resource management, the underlying computing infrastructure, and environmental cooling. Thus, the cloud-computing and dependability research communities have exerted considerable effort toward enhancing the reliability of system components against various software and hardware failures. However, as these systems have continued to grow in scale, with heterogeneity and complexity resulting in the manifestation of emergent behavior, so too have their respective failures. Recent studies of production cloud datacenters indicate the existence of complex failure manifestations that existing fault tolerance and recovery strategies are ill-equipped to effectively handle. These strategies can even be responsible for such failures. These emergent failures-frequently transient and identifiable only at runtime-represent a significant threat to designing reliable cloud systems. This article identifies the challenges of emergent failures in cloud datacenters at scale and their impact on system resource management, and discusses potential directions of further study for Internet of Things integration and holistic fault tolerance.

AB - Since the conception of cloud computing, ensuring its ability to provide highly reliable service has been of the upmost importance and criticality to the business objectives of providers and their customers. This has held true for every facet of the system, encompassing applications, resource management, the underlying computing infrastructure, and environmental cooling. Thus, the cloud-computing and dependability research communities have exerted considerable effort toward enhancing the reliability of system components against various software and hardware failures. However, as these systems have continued to grow in scale, with heterogeneity and complexity resulting in the manifestation of emergent behavior, so too have their respective failures. Recent studies of production cloud datacenters indicate the existence of complex failure manifestations that existing fault tolerance and recovery strategies are ill-equipped to effectively handle. These strategies can even be responsible for such failures. These emergent failures-frequently transient and identifiable only at runtime-represent a significant threat to designing reliable cloud systems. This article identifies the challenges of emergent failures in cloud datacenters at scale and their impact on system resource management, and discusses potential directions of further study for Internet of Things integration and holistic fault tolerance.

U2 - 10.1109/MCC.2018.053711662

DO - 10.1109/MCC.2018.053711662

M3 - Journal article

VL - 5

SP - 12

EP - 21

JO - IEEE Cloud Computing

JF - IEEE Cloud Computing

SN - 2325-6095

IS - 5

ER -