Home > Research > Publications & Outputs > Straggler root-cause and impact analysis for ma...

Electronic data

  • tsc2016b

    Rights statement: © 2019 IEEE. This is an author produced version of a paper published in IEEE Transactions on Services Computing. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Uploaded in accordance with the publisher's self-archiving policy.

    Accepted author manuscript, 2.11 MB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

Text available via DOI:

View graph of relations

Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published

Standard

Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. / Garraghan, Peter; Ouyang, Xue; Yang, Renyu et al.

In: IEEE Transactions on Services Computing, Vol. 12, No. 1, 01.01.2019, p. 91-104.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

Garraghan, P, Ouyang, X, Yang, R, McKee, D & Xu, J 2019, 'Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters', IEEE Transactions on Services Computing, vol. 12, no. 1, pp. 91-104. https://doi.org/10.1109/TSC.2016.2611578

APA

Garraghan, P., Ouyang, X., Yang, R., McKee, D., & Xu, J. (2019). Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. IEEE Transactions on Services Computing, 12(1), 91-104. https://doi.org/10.1109/TSC.2016.2611578

Vancouver

Garraghan P, Ouyang X, Yang R, McKee D, Xu J. Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. IEEE Transactions on Services Computing. 2019 Jan 1;12(1):91-104. Epub 2016 Sep 20. doi: 10.1109/TSC.2016.2611578

Author

Garraghan, Peter ; Ouyang, Xue ; Yang, Renyu et al. / Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. In: IEEE Transactions on Services Computing. 2019 ; Vol. 12, No. 1. pp. 91-104.

Bibtex

@article{6d6eda9a4aed443891e68b4ee023be3f,
title = "Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters",
abstract = "Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs.",
keywords = "Cloud computing, Straggler, Distributed Systems, Root-cause analysis, Datacenter ",
author = "Peter Garraghan and Xue Ouyang and Renyu Yang and David McKee and Jie Xu",
note = "{\textcopyright} 2019 IEEE. This is an author produced version of a paper published in IEEE Transactions on Services Computing. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Uploaded in accordance with the publisher's self-archiving policy.",
year = "2019",
month = jan,
day = "1",
doi = "10.1109/TSC.2016.2611578",
language = "English",
volume = "12",
pages = "91--104",
journal = "IEEE Transactions on Services Computing",
issn = "1939-1374",
publisher = "Institute of Electrical and Electronics Engineers",
number = "1",

}

RIS

TY - JOUR

T1 - Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters

AU - Garraghan, Peter

AU - Ouyang, Xue

AU - Yang, Renyu

AU - McKee, David

AU - Xu, Jie

N1 - © 2019 IEEE. This is an author produced version of a paper published in IEEE Transactions on Services Computing. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Uploaded in accordance with the publisher's self-archiving policy.

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs.

AB - Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs.

KW - Cloud computing

KW - Straggler

KW - Distributed Systems

KW - Root-cause analysis

KW - Datacenter

U2 - 10.1109/TSC.2016.2611578

DO - 10.1109/TSC.2016.2611578

M3 - Journal article

VL - 12

SP - 91

EP - 104

JO - IEEE Transactions on Services Computing

JF - IEEE Transactions on Services Computing

SN - 1939-1374

IS - 1

ER -