Horus - Research Portal | Lancaster University

Home > Research > Publications & Outputs > Horus

Computing and Communications

Associated organisational units

Electronic data

TPDS_Horus_ging_fung_yeung
Accepted author manuscript, 4.98 MB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Text available via DOI:

https://doi.org/10.1109/TPDS.2021.3079202
Final published version
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Keywords

distributed computing, Deep Learning, interference, cloud computing, GPU Scheduling

View graph of relations

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

More...

Article number	21015055
<mark>Journal publication date</mark>	31/01/2022
<mark>Journal</mark>	IEEE Transactions on Parallel and Distributed Systems
Issue number	1
Volume	33
Number of pages	13
Pages (from-to)	88-100
Publication Status	Published
Early online date	11/05/21
<mark>Original language</mark>	English

Abstract

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this paper we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model’s computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations across heterogeneous GPU hardware, we identify GPU utilization as a general proxy metric to determine good placement decisions, in contrast to current approaches which reserve isolated GPUs to perform online profiling and directly measure GPU utilization for each unique submitted job. Our approach promotes high resource utilization and makespan reduction; via real-world experimentation and large-scale trace driven simulation, we demonstrate that Horus outperforms other DL resource managers by up to 61.5% for GPU resource utilization, 23.7–30.7% for makespan reduction and 68.3% in job wait time reduction.

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

Keywords

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Abstract

Quick Links

Connect With Us

Faculties & Depts

Contact Us

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

Keywords

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Abstract

Related research outputs

Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning

Quick Links

Connect With Us

Faculties & Depts

Contact Us