Home > Research > Publications & Outputs > Reducing late-timing failure at scale

Electronic data

  • Reducing Late Timing Failure

    Accepted author manuscript, 193 KB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

View graph of relations

Reducing late-timing failure at scale: straggler root-cause analysis in cloud datacenters

Research output: Contribution to conference - Without ISBN/ISSN Conference paperpeer-review

Published
Close
Publication date18/08/2016
Number of pages2
<mark>Original language</mark>English
Event46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - École Nationale de l’Aviation Civile (ENAC), Toulouse, France
Duration: 28/06/20161/07/2016
https://dsn-2016.sciencesconf.org/

Conference

Conference46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Abbreviated titleDSN 2016
Country/TerritoryFrance
CityToulouse
Period28/06/161/07/16
Internet address

Abstract

Task stragglers hinder effective parallel job execution in Cloud datacenters, resulting in late-timing failures due to the violation of specified timing constraints. Straggler-tolerant methods such as speculative execution provide limited effectiveness due to (i) lack of precise straggler root-cause knowledge and (ii) straggler identification occurring too late within a job lifecycle. This paper proposes a method to ascertain underlying straggler root-causes by analyzing key parameters within large-scale distributed systems, and to determine the correlation between straggler occurrence and factors including resource contention, task concurrency, and server failures. Our preliminary study of a production Cloud datacenter indicates that the dominate straggler root-cause is resultant of high temporal resource contention. The result can assist in enhancing straggler prediction and mitigation for tolerating late-timing failures within large-scale distributed systems.