Home > Research > Publications & Outputs > Reducing late-timing failure at scale

Electronic data

  • Reducing Late Timing Failure

    Accepted author manuscript, 193 KB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

View graph of relations

Reducing late-timing failure at scale: straggler root-cause analysis in cloud datacenters

Research output: Contribution to conference - Without ISBN/ISSN Conference paperpeer-review

Published

Standard

Reducing late-timing failure at scale: straggler root-cause analysis in cloud datacenters. / Ouyang, Xue; Garraghan, Peter; Yang, Renyu et al.
2016. Paper presented at 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Toulouse, France.

Research output: Contribution to conference - Without ISBN/ISSN Conference paperpeer-review

Harvard

Ouyang, X, Garraghan, P, Yang, R, Townend, P & Xu, J 2016, 'Reducing late-timing failure at scale: straggler root-cause analysis in cloud datacenters', Paper presented at 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Toulouse, France, 28/06/16 - 1/07/16. <https://hal.archives-ouvertes.fr/hal-01316515>

APA

Ouyang, X., Garraghan, P., Yang, R., Townend, P., & Xu, J. (2016). Reducing late-timing failure at scale: straggler root-cause analysis in cloud datacenters. Paper presented at 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Toulouse, France. https://hal.archives-ouvertes.fr/hal-01316515

Vancouver

Ouyang X, Garraghan P, Yang R, Townend P, Xu J. Reducing late-timing failure at scale: straggler root-cause analysis in cloud datacenters. 2016. Paper presented at 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Toulouse, France.

Author

Ouyang, Xue ; Garraghan, Peter ; Yang, Renyu et al. / Reducing late-timing failure at scale : straggler root-cause analysis in cloud datacenters. Paper presented at 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Toulouse, France.2 p.

Bibtex

@conference{dcff2961b690466ead97a7c2b0b07eef,
title = "Reducing late-timing failure at scale: straggler root-cause analysis in cloud datacenters",
abstract = "Task stragglers hinder effective parallel job execution in Cloud datacenters, resulting in late-timing failures due to the violation of specified timing constraints. Straggler-tolerant methods such as speculative execution provide limited effectiveness due to (i) lack of precise straggler root-cause knowledge and (ii) straggler identification occurring too late within a job lifecycle. This paper proposes a method to ascertain underlying straggler root-causes by analyzing key parameters within large-scale distributed systems, and to determine the correlation between straggler occurrence and factors including resource contention, task concurrency, and server failures. Our preliminary study of a production Cloud datacenter indicates that the dominate straggler root-cause is resultant of high temporal resource contention. The result can assist in enhancing straggler prediction and mitigation for tolerating late-timing failures within large-scale distributed systems. ",
author = "Xue Ouyang and Peter Garraghan and Renyu Yang and Paul Townend and Jie Xu",
year = "2016",
month = aug,
day = "18",
language = "English",
note = "46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2016 ; Conference date: 28-06-2016 Through 01-07-2016",
url = "https://dsn-2016.sciencesconf.org/",

}

RIS

TY - CONF

T1 - Reducing late-timing failure at scale

T2 - 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

AU - Ouyang, Xue

AU - Garraghan, Peter

AU - Yang, Renyu

AU - Townend, Paul

AU - Xu, Jie

PY - 2016/8/18

Y1 - 2016/8/18

N2 - Task stragglers hinder effective parallel job execution in Cloud datacenters, resulting in late-timing failures due to the violation of specified timing constraints. Straggler-tolerant methods such as speculative execution provide limited effectiveness due to (i) lack of precise straggler root-cause knowledge and (ii) straggler identification occurring too late within a job lifecycle. This paper proposes a method to ascertain underlying straggler root-causes by analyzing key parameters within large-scale distributed systems, and to determine the correlation between straggler occurrence and factors including resource contention, task concurrency, and server failures. Our preliminary study of a production Cloud datacenter indicates that the dominate straggler root-cause is resultant of high temporal resource contention. The result can assist in enhancing straggler prediction and mitigation for tolerating late-timing failures within large-scale distributed systems.

AB - Task stragglers hinder effective parallel job execution in Cloud datacenters, resulting in late-timing failures due to the violation of specified timing constraints. Straggler-tolerant methods such as speculative execution provide limited effectiveness due to (i) lack of precise straggler root-cause knowledge and (ii) straggler identification occurring too late within a job lifecycle. This paper proposes a method to ascertain underlying straggler root-causes by analyzing key parameters within large-scale distributed systems, and to determine the correlation between straggler occurrence and factors including resource contention, task concurrency, and server failures. Our preliminary study of a production Cloud datacenter indicates that the dominate straggler root-cause is resultant of high temporal resource contention. The result can assist in enhancing straggler prediction and mitigation for tolerating late-timing failures within large-scale distributed systems.

M3 - Conference paper

Y2 - 28 June 2016 through 1 July 2016

ER -