An empirical failure-analysis of a large-scale cloud computing environment

Associated organisational unit

Security Lancaster

Electronic data

Empirical Failure analysis
Rights statement: © 2014 IEEE. This is an author produced version of a paper published in 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering, HASE 2014. Uploaded in accordance with the publisher's self-archiving policy. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting / republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.
Accepted author manuscript, 519 KB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.1109/HASE.2014.24
Final published version

View graph of relations

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

An empirical failure-analysis of a large-scale cloud computing environment. / Garraghan, Peter; Townend, Paul; Xu, Jie.
2014 IEEE 15th International Symposium on High-Assurance Systems Engineering. IEEE, 2014. p. 113-120.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

Garraghan, P, Townend, P & Xu, J 2014, An empirical failure-analysis of a large-scale cloud computing environment. in 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering. IEEE, pp. 113-120. https://doi.org/10.1109/HASE.2014.24

APA

Garraghan, P., Townend, P., & Xu, J. (2014). An empirical failure-analysis of a large-scale cloud computing environment. In 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering (pp. 113-120). IEEE. https://doi.org/10.1109/HASE.2014.24

Vancouver

Garraghan P, Townend P, Xu J. An empirical failure-analysis of a large-scale cloud computing environment. In 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering. IEEE. 2014. p. 113-120 doi: 10.1109/HASE.2014.24

Author

Garraghan, Peter ; Townend, Paul ; Xu, Jie. / An empirical failure-analysis of a large-scale cloud computing environment. 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering. IEEE, 2014. pp. 113-120

Bibtex

@inproceedings{7b4958fed33e4136a829cdc4be417d47,

title = "An empirical failure-analysis of a large-scale cloud computing environment",

abstract = "Cloud computing research is in great need of statistical parameters derived from the analysis of real-world systems. One aspect of this is the failure characteristics of Cloud environments composed of workloads and servers, currently, few metrics are available that quantify failure and repair times of workloads and servers at a large-scale. Workload metrics in particular are critical for characterizing and modeling accurate workload behavior, enabling more realistic workload simulation and failure scenarios of systems. This paper presents the analysis of failure data of a large-scale production Cloud environment (consisting of over 12,500 servers), and includes a study of failure and repair times and characteristics for both Cloud workloads and servers. Our results show that failure characteristics for workload and servers are highly variable and that production Cloud workloads can be accurately modeled by a Gamma distribution. Repair times range between 30 seconds to 4 days, and 25 minutes to 8 days, for workloads and servers respectively.",

author = "Peter Garraghan and Paul Townend and Jie Xu",

note = "{\textcopyright} 2014 IEEE. This is an author produced version of a paper published in 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering, HASE 2014. Uploaded in accordance with the publisher's self-archiving policy. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting / republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. ",

year = "2014",

month = mar,

day = "6",

doi = "10.1109/HASE.2014.24",

language = "English",

pages = "113--120",

booktitle = "2014 IEEE 15th International Symposium on High-Assurance Systems Engineering",

publisher = "IEEE",

}

RIS

TY - GEN

T1 - An empirical failure-analysis of a large-scale cloud computing environment

AU - Garraghan, Peter

AU - Townend, Paul

AU - Xu, Jie

N1 - © 2014 IEEE. This is an author produced version of a paper published in 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering, HASE 2014. Uploaded in accordance with the publisher's self-archiving policy. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting / republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.

PY - 2014/3/6

Y1 - 2014/3/6

N2 - Cloud computing research is in great need of statistical parameters derived from the analysis of real-world systems. One aspect of this is the failure characteristics of Cloud environments composed of workloads and servers, currently, few metrics are available that quantify failure and repair times of workloads and servers at a large-scale. Workload metrics in particular are critical for characterizing and modeling accurate workload behavior, enabling more realistic workload simulation and failure scenarios of systems. This paper presents the analysis of failure data of a large-scale production Cloud environment (consisting of over 12,500 servers), and includes a study of failure and repair times and characteristics for both Cloud workloads and servers. Our results show that failure characteristics for workload and servers are highly variable and that production Cloud workloads can be accurately modeled by a Gamma distribution. Repair times range between 30 seconds to 4 days, and 25 minutes to 8 days, for workloads and servers respectively.

AB - Cloud computing research is in great need of statistical parameters derived from the analysis of real-world systems. One aspect of this is the failure characteristics of Cloud environments composed of workloads and servers, currently, few metrics are available that quantify failure and repair times of workloads and servers at a large-scale. Workload metrics in particular are critical for characterizing and modeling accurate workload behavior, enabling more realistic workload simulation and failure scenarios of systems. This paper presents the analysis of failure data of a large-scale production Cloud environment (consisting of over 12,500 servers), and includes a study of failure and repair times and characteristics for both Cloud workloads and servers. Our results show that failure characteristics for workload and servers are highly variable and that production Cloud workloads can be accurately modeled by a Gamma distribution. Repair times range between 30 seconds to 4 days, and 25 minutes to 8 days, for workloads and servers respectively.

U2 - 10.1109/HASE.2014.24

DO - 10.1109/HASE.2014.24

M3 - Conference contribution/Paper

SP - 113

EP - 120

BT - 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering

PB - IEEE

ER -

Research

Associated organisational unit

Electronic data

Links

Text available via DOI: