Home > Research > Publications & Outputs > START

Electronic data

  • Tuli et al, START Straggler Prediction, IEEE TSC

    Rights statement: ©2021 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

    Accepted author manuscript, 1.81 MB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

Text available via DOI:

View graph of relations

START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published

Standard

START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks. / Tuli, Shreshth; Singh Gill, Sukhpal ; Garraghan, Peter et al.
In: IEEE Transactions on Services Computing, Vol. 16, No. 1, 31.01.2023, p. 615-627.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

Tuli, S, Singh Gill, S, Garraghan, P, Buyya, R, Casale, G & Jennings, NR 2023, 'START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks', IEEE Transactions on Services Computing, vol. 16, no. 1, pp. 615-627. https://doi.org/10.1109/TSC.2021.3129897

APA

Tuli, S., Singh Gill, S., Garraghan, P., Buyya, R., Casale, G., & Jennings, N. R. (2023). START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks. IEEE Transactions on Services Computing, 16(1), 615-627. https://doi.org/10.1109/TSC.2021.3129897

Vancouver

Tuli S, Singh Gill S, Garraghan P, Buyya R, Casale G, Jennings NR. START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks. IEEE Transactions on Services Computing. 2023 Jan 31;16(1):615-627. Epub 2021 Nov 23. doi: 10.1109/TSC.2021.3129897

Author

Tuli, Shreshth ; Singh Gill, Sukhpal ; Garraghan, Peter et al. / START : Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks. In: IEEE Transactions on Services Computing. 2023 ; Vol. 16, No. 1. pp. 615-627.

Bibtex

@article{9ba5a85a7b764d0fb5e84fdfcc6237a5,
title = "START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks",
abstract = "Modern large-scale computing systems distribute jobs into multiple smaller tasks which execute in parallel to accelerate job completion rates and reduce energy consumption. However, a common performance problem in such systems is dealing with straggler tasks that are slow running instances that increase the overall response time. Such tasks can significantly impact the system{\textquoteright}s Quality of Service (QoS) and the Service Level Agreements (SLA). To combat this issue, there is a need for automatic straggler detection andmitigation mechanisms that execute jobs without violating the SLA. Prior work typically builds reactive models that focus first on detection and then mitigation of straggler tasks, which leads to delays. Other works use prediction based proactive mechanisms, but ignore heterogeneous host or volatile task characteristics. In this paper, we propose a Straggler Prediction and Mitigation Technique (START) that is able to predict which tasks might be stragglers and dynamically adapt scheduling to achieve lower response times. Our technique analyzes all tasks and hosts based on compute and network resource consumption using an Encoder Long-Short-Term-Memory (LSTM) network. The output of this network is then used to predict and mitigate expected straggler tasks. This reduces the SLA violation rate and execution time without compromising QoS. Specifically, we use the CloudSim toolkit to simulate START in a cloud environment and compare it with state-of-the-art techniques (IGRU-SD, SGC, Dolly, GRASS, NearestFit and Wrangler) in terms of QoS parameters such as energy consumption, execution time, resource contention, CPU utilization and SLA violation rate. Experiments show that START reduces execution time, resource contention, energy and SLA violations by 13%, 11%, 16% and 19%, respectively, compared to the state-of-the-art approaches.",
keywords = "Straggler, Deep Learning, Cloud computing, Prediction",
author = "Shreshth Tuli and {Singh Gill}, Sukhpal and Peter Garraghan and Rajkumar Buyya and Giuliano Casale and Jennings, {Nicholas R.}",
note = "{\textcopyright}2021 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. ",
year = "2023",
month = jan,
day = "31",
doi = "10.1109/TSC.2021.3129897",
language = "English",
volume = "16",
pages = "615--627",
journal = "IEEE Transactions on Services Computing",
issn = "1939-1374",
publisher = "Institute of Electrical and Electronics Engineers",
number = "1",

}

RIS

TY - JOUR

T1 - START

T2 - Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks

AU - Tuli, Shreshth

AU - Singh Gill, Sukhpal

AU - Garraghan, Peter

AU - Buyya, Rajkumar

AU - Casale, Giuliano

AU - Jennings, Nicholas R.

N1 - ©2021 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

PY - 2023/1/31

Y1 - 2023/1/31

N2 - Modern large-scale computing systems distribute jobs into multiple smaller tasks which execute in parallel to accelerate job completion rates and reduce energy consumption. However, a common performance problem in such systems is dealing with straggler tasks that are slow running instances that increase the overall response time. Such tasks can significantly impact the system’s Quality of Service (QoS) and the Service Level Agreements (SLA). To combat this issue, there is a need for automatic straggler detection andmitigation mechanisms that execute jobs without violating the SLA. Prior work typically builds reactive models that focus first on detection and then mitigation of straggler tasks, which leads to delays. Other works use prediction based proactive mechanisms, but ignore heterogeneous host or volatile task characteristics. In this paper, we propose a Straggler Prediction and Mitigation Technique (START) that is able to predict which tasks might be stragglers and dynamically adapt scheduling to achieve lower response times. Our technique analyzes all tasks and hosts based on compute and network resource consumption using an Encoder Long-Short-Term-Memory (LSTM) network. The output of this network is then used to predict and mitigate expected straggler tasks. This reduces the SLA violation rate and execution time without compromising QoS. Specifically, we use the CloudSim toolkit to simulate START in a cloud environment and compare it with state-of-the-art techniques (IGRU-SD, SGC, Dolly, GRASS, NearestFit and Wrangler) in terms of QoS parameters such as energy consumption, execution time, resource contention, CPU utilization and SLA violation rate. Experiments show that START reduces execution time, resource contention, energy and SLA violations by 13%, 11%, 16% and 19%, respectively, compared to the state-of-the-art approaches.

AB - Modern large-scale computing systems distribute jobs into multiple smaller tasks which execute in parallel to accelerate job completion rates and reduce energy consumption. However, a common performance problem in such systems is dealing with straggler tasks that are slow running instances that increase the overall response time. Such tasks can significantly impact the system’s Quality of Service (QoS) and the Service Level Agreements (SLA). To combat this issue, there is a need for automatic straggler detection andmitigation mechanisms that execute jobs without violating the SLA. Prior work typically builds reactive models that focus first on detection and then mitigation of straggler tasks, which leads to delays. Other works use prediction based proactive mechanisms, but ignore heterogeneous host or volatile task characteristics. In this paper, we propose a Straggler Prediction and Mitigation Technique (START) that is able to predict which tasks might be stragglers and dynamically adapt scheduling to achieve lower response times. Our technique analyzes all tasks and hosts based on compute and network resource consumption using an Encoder Long-Short-Term-Memory (LSTM) network. The output of this network is then used to predict and mitigate expected straggler tasks. This reduces the SLA violation rate and execution time without compromising QoS. Specifically, we use the CloudSim toolkit to simulate START in a cloud environment and compare it with state-of-the-art techniques (IGRU-SD, SGC, Dolly, GRASS, NearestFit and Wrangler) in terms of QoS parameters such as energy consumption, execution time, resource contention, CPU utilization and SLA violation rate. Experiments show that START reduces execution time, resource contention, energy and SLA violations by 13%, 11%, 16% and 19%, respectively, compared to the state-of-the-art approaches.

KW - Straggler

KW - Deep Learning

KW - Cloud computing

KW - Prediction

U2 - 10.1109/TSC.2021.3129897

DO - 10.1109/TSC.2021.3129897

M3 - Journal article

VL - 16

SP - 615

EP - 627

JO - IEEE Transactions on Services Computing

JF - IEEE Transactions on Services Computing

SN - 1939-1374

IS - 1

ER -