Reliable computing service in massive-scale systems through rapid low-cost failover

Computing and Communications

Associated organisational units

Electronic data

tsc-2016-camera-ready-v10
Rights statement: © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.
Accepted author manuscript, 3.2 MB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.1109/TSC.2016.2544313
Final published version

Keywords

Failover, Cloud computing, Resource management, Reliability, Services

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Reliable computing service in massive-scale systems through rapid low-cost failover. / Yang, Renyu; Zhang, Yang; Garraghan, Peter et al.
In: IEEE Transactions on Services Computing, Vol. 10, No. 6, 11.2017, p. 969-983.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Yang, R, Zhang, Y, Garraghan, P, Feng, Y, Ouyang, J, Xu, J, Zhang, Z & Li, C 2017, 'Reliable computing service in massive-scale systems through rapid low-cost failover', IEEE Transactions on Services Computing, vol. 10, no. 6, pp. 969-983. https://doi.org/10.1109/TSC.2016.2544313

APA

Yang, R., Zhang, Y., Garraghan, P., Feng, Y., Ouyang, J., Xu, J., Zhang, Z., & Li, C. (2017). Reliable computing service in massive-scale systems through rapid low-cost failover. IEEE Transactions on Services Computing, 10(6), 969-983. https://doi.org/10.1109/TSC.2016.2544313

Vancouver

Yang R, Zhang Y, Garraghan P, Feng Y, Ouyang J, Xu J et al. Reliable computing service in massive-scale systems through rapid low-cost failover. IEEE Transactions on Services Computing. 2017 Nov;10(6):969-983. Epub 2016 Mar 21. doi: 10.1109/TSC.2016.2544313

Author

Yang, Renyu ; Zhang, Yang ; Garraghan, Peter et al. / Reliable computing service in massive-scale systems through rapid low-cost failover. In: IEEE Transactions on Services Computing. 2017 ; Vol. 10, No. 6. pp. 969-983.

Bibtex

@article{9ac0e732a6004028a57aa9c058715a77,

title = "Reliable computing service in massive-scale systems through rapid low-cost failover",

abstract = "Large-scale distributed systems in Cloud datacenter are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely used means to achieve such a goal is using redundant system components to implement usertransparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed – an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, e.g. timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71% additional CPU usage.",

keywords = "Failover, Cloud computing, Resource management, Reliability, Services",

author = "Renyu Yang and Yang Zhang and Peter Garraghan and Yihui Feng and Jin Ouyang and Jie Xu and Zhuo Zhang and Chao Li",

note = "{\textcopyright} 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.",

year = "2017",

month = nov,

doi = "10.1109/TSC.2016.2544313",

language = "English",

volume = "10",

pages = "969--983",

journal = "IEEE Transactions on Services Computing",

issn = "1939-1374",

publisher = "Institute of Electrical and Electronics Engineers",

number = "6",

}

RIS

TY - JOUR

T1 - Reliable computing service in massive-scale systems through rapid low-cost failover

AU - Yang, Renyu

AU - Zhang, Yang

AU - Garraghan, Peter

AU - Feng, Yihui

AU - Ouyang, Jin

AU - Xu, Jie

AU - Zhang, Zhuo

AU - Li, Chao

N1 - © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.

PY - 2017/11

Y1 - 2017/11

N2 - Large-scale distributed systems in Cloud datacenter are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely used means to achieve such a goal is using redundant system components to implement usertransparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed – an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, e.g. timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71% additional CPU usage.

AB - Large-scale distributed systems in Cloud datacenter are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely used means to achieve such a goal is using redundant system components to implement usertransparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed – an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, e.g. timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71% additional CPU usage.

KW - Failover

KW - Cloud computing

KW - Resource management

KW - Reliability

KW - Services

U2 - 10.1109/TSC.2016.2544313

DO - 10.1109/TSC.2016.2544313

M3 - Journal article

VL - 10

SP - 969

EP - 983

JO - IEEE Transactions on Services Computing

JF - IEEE Transactions on Services Computing

SN - 1939-1374

IS - 6

ER -

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

Keywords