An approach for modeling and ranking node-level stragglers in cloud datacenters

Associated organisational units

Electronic data

An Approach for Modeling and Ranking Node-level Stragglers in Cloud Datacenters
Rights statement: © 2016, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.
Accepted author manuscript, 784 KB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.1109/SCC.2016.93
Final published version

Keywords

Servers, Production, Data models, Computational modeling, Analytical models, Time factors, Calculators

View graph of relations

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

An approach for modeling and ranking node-level stragglers in cloud datacenters. / Ouyang, Xue; Garraghan, Peter; Wang, Changjian et al.
2016 IEEE International Conference on Services Computing (SCC) . IEEE, 2016. p. 673-680.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

Ouyang, X, Garraghan, P, Wang, C, Townend, P & Xu, J 2016, An approach for modeling and ranking node-level stragglers in cloud datacenters. in 2016 IEEE International Conference on Services Computing (SCC) . IEEE, pp. 673-680. https://doi.org/10.1109/SCC.2016.93

APA

Ouyang, X., Garraghan, P., Wang, C., Townend, P., & Xu, J. (2016). An approach for modeling and ranking node-level stragglers in cloud datacenters. In 2016 IEEE International Conference on Services Computing (SCC) (pp. 673-680). IEEE. https://doi.org/10.1109/SCC.2016.93

Vancouver

Ouyang X, Garraghan P, Wang C, Townend P, Xu J. An approach for modeling and ranking node-level stragglers in cloud datacenters. In 2016 IEEE International Conference on Services Computing (SCC) . IEEE. 2016. p. 673-680 doi: 10.1109/SCC.2016.93

Author

Ouyang, Xue ; Garraghan, Peter ; Wang, Changjian et al. / An approach for modeling and ranking node-level stragglers in cloud datacenters. 2016 IEEE International Conference on Services Computing (SCC) . IEEE, 2016. pp. 673-680

Bibtex

@inproceedings{98c33c2c0dca47f8959e3bc7eb2eb573,

title = "An approach for modeling and ranking node-level stragglers in cloud datacenters",

abstract = "The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede parallel job execution. However, it is currently unknown how slow nodes directly correlate to task straggler manifestation. To address this knowledge gap, we propose a method for node performance modeling and ranking in Cloud datacenters based on analyzing parallel job execution tracelog data. By using a production Cloud system as a case study, we demonstrate how node execution performance is driven by temporal changes in node operation as opposed to node hardware capacity. Different sample sets have been filtered in order to evaluate the generality of our framework, and the analytic results demonstrate that node abilities of executing parallel tasks tend to follow a 3-parameter-loglogistic distribution. Further statistical attribute values such as confidence interval, quantile value, extreme case possibility, etc. can also be used for ranking and identifying potential straggler nodes within the cluster. We exploit a graph-based algorithm for partitioning server nodes into five levels, with 0.83% of node-level stragglers identified. Our work lays the foundation towards enhancing scheduling algorithms by avoiding slow nodes, reducing task straggler occurrence, and improving parallel job performance.",

keywords = "Servers, Production, Data models, Computational modeling, Analytical models, Time factors, Calculators",

author = "Xue Ouyang and Peter Garraghan and Changjian Wang and Paul Townend and Jie Xu",

note = "{\textcopyright} 2016, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.",

year = "2016",

month = sep,

day = "1",

doi = "10.1109/SCC.2016.93",

language = "English",

pages = "673--680",

booktitle = "2016 IEEE International Conference on Services Computing (SCC)",

publisher = "IEEE",

}

RIS

TY - GEN

T1 - An approach for modeling and ranking node-level stragglers in cloud datacenters

AU - Ouyang, Xue

AU - Garraghan, Peter

AU - Wang, Changjian

AU - Townend, Paul

AU - Xu, Jie

N1 - © 2016, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.

PY - 2016/9/1

Y1 - 2016/9/1

N2 - The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede parallel job execution. However, it is currently unknown how slow nodes directly correlate to task straggler manifestation. To address this knowledge gap, we propose a method for node performance modeling and ranking in Cloud datacenters based on analyzing parallel job execution tracelog data. By using a production Cloud system as a case study, we demonstrate how node execution performance is driven by temporal changes in node operation as opposed to node hardware capacity. Different sample sets have been filtered in order to evaluate the generality of our framework, and the analytic results demonstrate that node abilities of executing parallel tasks tend to follow a 3-parameter-loglogistic distribution. Further statistical attribute values such as confidence interval, quantile value, extreme case possibility, etc. can also be used for ranking and identifying potential straggler nodes within the cluster. We exploit a graph-based algorithm for partitioning server nodes into five levels, with 0.83% of node-level stragglers identified. Our work lays the foundation towards enhancing scheduling algorithms by avoiding slow nodes, reducing task straggler occurrence, and improving parallel job performance.

AB - The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede parallel job execution. However, it is currently unknown how slow nodes directly correlate to task straggler manifestation. To address this knowledge gap, we propose a method for node performance modeling and ranking in Cloud datacenters based on analyzing parallel job execution tracelog data. By using a production Cloud system as a case study, we demonstrate how node execution performance is driven by temporal changes in node operation as opposed to node hardware capacity. Different sample sets have been filtered in order to evaluate the generality of our framework, and the analytic results demonstrate that node abilities of executing parallel tasks tend to follow a 3-parameter-loglogistic distribution. Further statistical attribute values such as confidence interval, quantile value, extreme case possibility, etc. can also be used for ranking and identifying potential straggler nodes within the cluster. We exploit a graph-based algorithm for partitioning server nodes into five levels, with 0.83% of node-level stragglers identified. Our work lays the foundation towards enhancing scheduling algorithms by avoiding slow nodes, reducing task straggler occurrence, and improving parallel job performance.

KW - Servers

KW - Production

KW - Data models

KW - Computational modeling

KW - Analytical models

KW - Time factors

KW - Calculators

U2 - 10.1109/SCC.2016.93

DO - 10.1109/SCC.2016.93

M3 - Conference contribution/Paper

SP - 673

EP - 680

BT - 2016 IEEE International Conference on Services Computing (SCC)

PB - IEEE

ER -

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

Keywords