Horus - Research Portal | Lancaster University

Associated organisational units

Electronic data

ICA3PP - Horus - Yeung (Accepted)
Accepted author manuscript, 882 KB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.1007/978-3-030-60239-0_33
Final published version

Keywords

Machine Learning Systems, Performance Interference, Deep Learning, GPU Scheduling, Cluster resource management

View graph of relations

Horus: An Interference-aware Resource Manager for Deep Learning Systems

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

Horus: An Interference-aware Resource Manager for Deep Learning Systems. / Yeung, Gingfung ; Borowiec, Damian; Yang, Renyu et al.
Algorithms and Architectures for Parallel Processing. ICA3PP 2020. ed. / M. Qiu. Springer, 2020. p. 492-508 (Lecture Notes in Computer Science; Vol. 12453).

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

Yeung, G , Borowiec, D, Yang, R, Friday, A , Harper, RHR & Garraghan, P 2020, Horus: An Interference-aware Resource Manager for Deep Learning Systems. in M Qiu (ed.), Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science, vol. 12453, Springer, pp. 492-508. https://doi.org/10.1007/978-3-030-60239-0_33

APA

Yeung, G., Borowiec, D., Yang, R., Friday, A., Harper, R. H. R., & Garraghan, P. (2020). Horus: An Interference-aware Resource Manager for Deep Learning Systems. In M. Qiu (Ed.), Algorithms and Architectures for Parallel Processing. ICA3PP 2020 (pp. 492-508). (Lecture Notes in Computer Science; Vol. 12453). Springer. https://doi.org/10.1007/978-3-030-60239-0_33

Vancouver

Yeung G , Borowiec D, Yang R, Friday A , Harper RHR , Garraghan P. Horus: An Interference-aware Resource Manager for Deep Learning Systems. In Qiu M, editor, Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Springer. 2020. p. 492-508. (Lecture Notes in Computer Science). doi: 10.1007/978-3-030-60239-0_33

Author

Yeung, Gingfung ; Borowiec, Damian ; Yang, Renyu et al. / Horus : An Interference-aware Resource Manager for Deep Learning Systems. Algorithms and Architectures for Parallel Processing. ICA3PP 2020. editor / M. Qiu. Springer, 2020. pp. 492-508 (Lecture Notes in Computer Science).

Bibtex

@inproceedings{424f183a402a49379d52cda3496cfa8b,

title = "Horus: An Interference-aware Resource Manager for Deep Learning Systems",

abstract = "Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - ranging from a singular GPU device to machine clusters - require state-of-the-art resource management to increase resource utilization and job throughput. While it has been identiﬁed that co-location - multiple jobs co-located within the same GPU - is an eﬀective means to achieve this, such co-location incurs performance interference that directly debilitates DL training and inference performance. Existing approaches to mitigate interference require resource intensive and time consuming kernel proﬁling ill-suited for runtime scheduling decisions. Current DL system resource management are not designed to deal with these problems. This paper proposes Horus, an interference-aware resource manager for DL systems. Instead of leveraging expensive kernel-proﬁling, our approach estimates job resource utilization and co-location patterns to determine eﬀective DL job placement to minimize likelihood of interference, as well as improve system resource utilization and makespan. Our analysis shows that interference cause up to 3.2x DL job slowdown. We integrated our approach within the Kubernetes resource manager, and conduct experiments in a DL cluster by training 2,500 DL jobs using 13 diﬀerent models types. Results demonstrate that Horus is able to outperform other DL resource managers by up to 61.5% for resource utilization and 33.6% for makespan.",

keywords = "Machine Learning Systems, Performance Interference, Deep Learning, GPU Scheduling, Cluster resource management",

author = "Gingfung Yeung and Damian Borowiec and Renyu Yang and Adrian Friday and R.H.R. Harper and Peter Garraghan",

year = "2020",

month = sep,

day = "29",

doi = "10.1007/978-3-030-60239-0_33",

language = "English",

isbn = "9783030602383",

series = "Lecture Notes in Computer Science",

publisher = "Springer",

pages = "492--508",

editor = "M. Qiu",

booktitle = "Algorithms and Architectures for Parallel Processing. ICA3PP 2020",

}

RIS

TY - GEN

T1 - Horus

T2 - An Interference-aware Resource Manager for Deep Learning Systems

AU - Yeung, Gingfung

AU - Borowiec, Damian

AU - Yang, Renyu

AU - Friday, Adrian

AU - Harper, R.H.R.

AU - Garraghan, Peter

PY - 2020/9/29

Y1 - 2020/9/29

N2 - Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - ranging from a singular GPU device to machine clusters - require state-of-the-art resource management to increase resource utilization and job throughput. While it has been identiﬁed that co-location - multiple jobs co-located within the same GPU - is an eﬀective means to achieve this, such co-location incurs performance interference that directly debilitates DL training and inference performance. Existing approaches to mitigate interference require resource intensive and time consuming kernel proﬁling ill-suited for runtime scheduling decisions. Current DL system resource management are not designed to deal with these problems. This paper proposes Horus, an interference-aware resource manager for DL systems. Instead of leveraging expensive kernel-proﬁling, our approach estimates job resource utilization and co-location patterns to determine eﬀective DL job placement to minimize likelihood of interference, as well as improve system resource utilization and makespan. Our analysis shows that interference cause up to 3.2x DL job slowdown. We integrated our approach within the Kubernetes resource manager, and conduct experiments in a DL cluster by training 2,500 DL jobs using 13 diﬀerent models types. Results demonstrate that Horus is able to outperform other DL resource managers by up to 61.5% for resource utilization and 33.6% for makespan.

AB - Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - ranging from a singular GPU device to machine clusters - require state-of-the-art resource management to increase resource utilization and job throughput. While it has been identiﬁed that co-location - multiple jobs co-located within the same GPU - is an eﬀective means to achieve this, such co-location incurs performance interference that directly debilitates DL training and inference performance. Existing approaches to mitigate interference require resource intensive and time consuming kernel proﬁling ill-suited for runtime scheduling decisions. Current DL system resource management are not designed to deal with these problems. This paper proposes Horus, an interference-aware resource manager for DL systems. Instead of leveraging expensive kernel-proﬁling, our approach estimates job resource utilization and co-location patterns to determine eﬀective DL job placement to minimize likelihood of interference, as well as improve system resource utilization and makespan. Our analysis shows that interference cause up to 3.2x DL job slowdown. We integrated our approach within the Kubernetes resource manager, and conduct experiments in a DL cluster by training 2,500 DL jobs using 13 diﬀerent models types. Results demonstrate that Horus is able to outperform other DL resource managers by up to 61.5% for resource utilization and 33.6% for makespan.

KW - Machine Learning Systems

KW - Performance Interference

KW - Deep Learning

KW - GPU Scheduling

KW - Cluster resource management

U2 - 10.1007/978-3-030-60239-0_33

DO - 10.1007/978-3-030-60239-0_33

M3 - Conference contribution/Paper

SN - 9783030602383

T3 - Lecture Notes in Computer Science

SP - 492

EP - 508

BT - Algorithms and Architectures for Parallel Processing. ICA3PP 2020

A2 - Qiu, M.

PB - Springer

ER -

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

Keywords

Horus: An Interference-aware Resource Manager for Deep Learning Systems

Standard

Harvard

APA

Vancouver

Author

Bibtex

RIS

Quick Links

Connect With Us

Faculties & Depts

Contact Us