Home > Research > Publications & Outputs > Straggler detection in parallel computing syste...

Electronic data

  • PID4055511_final

    Rights statement: (c) 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.

    Accepted author manuscript, 331 KB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

Text available via DOI:

View graph of relations

Straggler detection in parallel computing systems through dynamic threshold calculation

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

Straggler detection in parallel computing systems through dynamic threshold calculation. / Ouyang, Xue; Garraghan, Peter; McKee, David et al.
2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA) . IEEE, 2016. p. 414-421.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Ouyang, X, Garraghan, P, McKee, D, Townend, P & Xu, J 2016, Straggler detection in parallel computing systems through dynamic threshold calculation. in 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA) . IEEE, pp. 414-421. https://doi.org/10.1109/AINA.2016.84

APA

Ouyang, X., Garraghan, P., McKee, D., Townend, P., & Xu, J. (2016). Straggler detection in parallel computing systems through dynamic threshold calculation. In 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA) (pp. 414-421). IEEE. https://doi.org/10.1109/AINA.2016.84

Vancouver

Ouyang X, Garraghan P, McKee D, Townend P, Xu J. Straggler detection in parallel computing systems through dynamic threshold calculation. In 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA) . IEEE. 2016. p. 414-421 doi: 10.1109/AINA.2016.84

Author

Ouyang, Xue ; Garraghan, Peter ; McKee, David et al. / Straggler detection in parallel computing systems through dynamic threshold calculation. 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA) . IEEE, 2016. pp. 414-421

Bibtex

@inproceedings{fec2284538f54f5888063a444f630876,
title = "Straggler detection in parallel computing systems through dynamic threshold calculation",
abstract = "Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large-scale distributed systems to tolerate stragglers. This approach defines stragglers by specifying a static threshold value, which calculates the temporal difference between an individual task and the average task progression for a job. However, specifying static threshold debilitates speculation effectiveness as it fails to consider the intrinsic diversity of job timing constraints within modern day Cloud computing systems. Capturing such heterogeneity enables the ability to impose different levels of strictness for replica creation while achieving specified levels of QoS for different application types. Furthermore, a static threshold also fails to consider system environmental constraints in terms of replication overheads and optimal system resource usage. In this paper we present an algorithm for dynamically calculating a threshold value to identify task stragglers, considering key parameters including job QoS timing constraints, task execution characteristics, and optimal system resource utilization. We study and demonstrate the effectiveness of our algorithm through simulating a number of different operational scenarios based on real production cluster data against state-of-the-art solutions. Results demonstrate that our approach is capable of creating 58.62% less replicas under high resource utilization while reducing response time up to 17.86% for idle periods compared to a static threshold.",
keywords = "Quality of service, Timing, Heuristic algorithms, Cloud computing, Time factors, Resource management, Production",
author = "Xue Ouyang and Peter Garraghan and David McKee and Paul Townend and Jie Xu",
note = "(c) 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.",
year = "2016",
month = may,
day = "23",
doi = "10.1109/AINA.2016.84",
language = "English",
pages = "414--421",
booktitle = "2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA)",
publisher = "IEEE",

}

RIS

TY - GEN

T1 - Straggler detection in parallel computing systems through dynamic threshold calculation

AU - Ouyang, Xue

AU - Garraghan, Peter

AU - McKee, David

AU - Townend, Paul

AU - Xu, Jie

N1 - (c) 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.

PY - 2016/5/23

Y1 - 2016/5/23

N2 - Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large-scale distributed systems to tolerate stragglers. This approach defines stragglers by specifying a static threshold value, which calculates the temporal difference between an individual task and the average task progression for a job. However, specifying static threshold debilitates speculation effectiveness as it fails to consider the intrinsic diversity of job timing constraints within modern day Cloud computing systems. Capturing such heterogeneity enables the ability to impose different levels of strictness for replica creation while achieving specified levels of QoS for different application types. Furthermore, a static threshold also fails to consider system environmental constraints in terms of replication overheads and optimal system resource usage. In this paper we present an algorithm for dynamically calculating a threshold value to identify task stragglers, considering key parameters including job QoS timing constraints, task execution characteristics, and optimal system resource utilization. We study and demonstrate the effectiveness of our algorithm through simulating a number of different operational scenarios based on real production cluster data against state-of-the-art solutions. Results demonstrate that our approach is capable of creating 58.62% less replicas under high resource utilization while reducing response time up to 17.86% for idle periods compared to a static threshold.

AB - Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large-scale distributed systems to tolerate stragglers. This approach defines stragglers by specifying a static threshold value, which calculates the temporal difference between an individual task and the average task progression for a job. However, specifying static threshold debilitates speculation effectiveness as it fails to consider the intrinsic diversity of job timing constraints within modern day Cloud computing systems. Capturing such heterogeneity enables the ability to impose different levels of strictness for replica creation while achieving specified levels of QoS for different application types. Furthermore, a static threshold also fails to consider system environmental constraints in terms of replication overheads and optimal system resource usage. In this paper we present an algorithm for dynamically calculating a threshold value to identify task stragglers, considering key parameters including job QoS timing constraints, task execution characteristics, and optimal system resource utilization. We study and demonstrate the effectiveness of our algorithm through simulating a number of different operational scenarios based on real production cluster data against state-of-the-art solutions. Results demonstrate that our approach is capable of creating 58.62% less replicas under high resource utilization while reducing response time up to 17.86% for idle periods compared to a static threshold.

KW - Quality of service

KW - Timing

KW - Heuristic algorithms

KW - Cloud computing

KW - Time factors

KW - Resource management

KW - Production

U2 - 10.1109/AINA.2016.84

DO - 10.1109/AINA.2016.84

M3 - Conference contribution/Paper

SP - 414

EP - 421

BT - 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA)

PB - IEEE

ER -