Performance-aware Speculative Resource Oversubscription for Large-scale Clusters

Computing and Communications

Associated organisational units

Electronic data

tpds2020-rose
Rights statement: ©2020 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Accepted author manuscript, 3.51 MB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Text available via DOI:

https://doi.org/10.1109/TPDS.2020.2970013
Final published version

Keywords

Resource scheduling, Oversubscription, Cluster utilization, Resource throttling, QoS

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Performance-aware Speculative Resource Oversubscription for Large-scale Clusters. / Yang, Renyu; Sun, Xiaoyang; Hu, Chunming et al.
In: IEEE Transactions on Parallel and Distributed Systems, Vol. 31, No. 7, 28.01.2020, p. 1499-1517.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Yang, R, Sun, X, Hu, C, Garraghan, P, Wo, T, Wen, Z, Peng, H, Xu, J & Li, C 2020, 'Performance-aware Speculative Resource Oversubscription for Large-scale Clusters', IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 7, pp. 1499-1517. https://doi.org/10.1109/TPDS.2020.2970013

APA

Yang, R., Sun, X., Hu, C., Garraghan, P., Wo, T., Wen, Z., Peng, H., Xu, J., & Li, C. (2020). Performance-aware Speculative Resource Oversubscription for Large-scale Clusters. IEEE Transactions on Parallel and Distributed Systems, 31(7), 1499-1517. https://doi.org/10.1109/TPDS.2020.2970013

Vancouver

Yang R, Sun X, Hu C, Garraghan P, Wo T, Wen Z et al. Performance-aware Speculative Resource Oversubscription for Large-scale Clusters. IEEE Transactions on Parallel and Distributed Systems. 2020 Jan 28;31(7):1499-1517. doi: 10.1109/TPDS.2020.2970013

Author

Yang, Renyu ; Sun, Xiaoyang ; Hu, Chunming et al. / Performance-aware Speculative Resource Oversubscription for Large-scale Clusters. In: IEEE Transactions on Parallel and Distributed Systems. 2020 ; Vol. 31, No. 7. pp. 1499-1517.

Bibtex

@article{877e6f882b954a11a763062e83be3dbe,

title = "Performance-aware Speculative Resource Oversubscription for Large-scale Clusters",

abstract = "It is a long-standing challenge to achieve a high degree of resource utilization in cluster scheduling. Resource oversubscription has become a common practice in improving resource utilization and cost reduction. However, current centralizedapproaches to oversubscription suffer from the issue with resource mismatch and fail to take into account other performance requirements, e.g., tail latency. In this paper we present ROSE, a new resource management platform capable of conducting performance-aware resource oversubscription. ROSE allows latency-sensitive long-running applications (LRAs) to co-exist with computation-intensive batch jobs. Instead of waiting for resource allocation to be confirmed by the centralized scheduler, job managers in ROSE can independently request to launch speculative tasks within specific machines according to their suitability for oversubscription. Node agents of those machines can however avoid any excessive resource oversubscription by means of a mechanism for admission control using multi-resource threshold control and performance-aware resource throttle. Experiments show that in case of mixed co-location of batch jobs and latency-sensitive LRAs, the CPU utilization and the disk utilization can reach56.34% and 43.49%, respectively, but the 95th percentile of read latency in YCSB workloads only increases by 5.4% against the case of executing the LRAs alone.",

keywords = "Resource scheduling, Oversubscription, Cluster utilization, Resource throttling, QoS",

author = "Renyu Yang and Xiaoyang Sun and Chunming Hu and Peter Garraghan and Tianyu Wo and Zhenyu Wen and Hao Peng and Jie Xu and Chao Li",

note = "{\textcopyright}2020 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. ",

year = "2020",

month = jan,

day = "28",

doi = "10.1109/TPDS.2020.2970013",

language = "English",

volume = "31",

pages = "1499--1517",

journal = "IEEE Transactions on Parallel and Distributed Systems",

issn = "1045-9219",

publisher = "IEEE Computer Society",

number = "7",

}

RIS

TY - JOUR

T1 - Performance-aware Speculative Resource Oversubscription for Large-scale Clusters

AU - Yang, Renyu

AU - Sun, Xiaoyang

AU - Hu, Chunming

AU - Garraghan, Peter

AU - Wo, Tianyu

AU - Wen, Zhenyu

AU - Peng, Hao

AU - Xu, Jie

AU - Li, Chao

N1 - ©2020 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

PY - 2020/1/28

Y1 - 2020/1/28

N2 - It is a long-standing challenge to achieve a high degree of resource utilization in cluster scheduling. Resource oversubscription has become a common practice in improving resource utilization and cost reduction. However, current centralizedapproaches to oversubscription suffer from the issue with resource mismatch and fail to take into account other performance requirements, e.g., tail latency. In this paper we present ROSE, a new resource management platform capable of conducting performance-aware resource oversubscription. ROSE allows latency-sensitive long-running applications (LRAs) to co-exist with computation-intensive batch jobs. Instead of waiting for resource allocation to be confirmed by the centralized scheduler, job managers in ROSE can independently request to launch speculative tasks within specific machines according to their suitability for oversubscription. Node agents of those machines can however avoid any excessive resource oversubscription by means of a mechanism for admission control using multi-resource threshold control and performance-aware resource throttle. Experiments show that in case of mixed co-location of batch jobs and latency-sensitive LRAs, the CPU utilization and the disk utilization can reach56.34% and 43.49%, respectively, but the 95th percentile of read latency in YCSB workloads only increases by 5.4% against the case of executing the LRAs alone.

AB - It is a long-standing challenge to achieve a high degree of resource utilization in cluster scheduling. Resource oversubscription has become a common practice in improving resource utilization and cost reduction. However, current centralizedapproaches to oversubscription suffer from the issue with resource mismatch and fail to take into account other performance requirements, e.g., tail latency. In this paper we present ROSE, a new resource management platform capable of conducting performance-aware resource oversubscription. ROSE allows latency-sensitive long-running applications (LRAs) to co-exist with computation-intensive batch jobs. Instead of waiting for resource allocation to be confirmed by the centralized scheduler, job managers in ROSE can independently request to launch speculative tasks within specific machines according to their suitability for oversubscription. Node agents of those machines can however avoid any excessive resource oversubscription by means of a mechanism for admission control using multi-resource threshold control and performance-aware resource throttle. Experiments show that in case of mixed co-location of batch jobs and latency-sensitive LRAs, the CPU utilization and the disk utilization can reach56.34% and 43.49%, respectively, but the 95th percentile of read latency in YCSB workloads only increases by 5.4% against the case of executing the LRAs alone.

KW - Resource scheduling

KW - Oversubscription

KW - Cluster utilization

KW - Resource throttling

KW - QoS

U2 - 10.1109/TPDS.2020.2970013

DO - 10.1109/TPDS.2020.2970013

M3 - Journal article

VL - 31

SP - 1499

EP - 1517

JO - IEEE Transactions on Parallel and Distributed Systems

JF - IEEE Transactions on Parallel and Distributed Systems

SN - 1045-9219

IS - 7

ER -

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

Keywords