Proactive Interference-aware Resource Management in Deep Learning Training Cluster

Computing and Communications

Electronic data

2022yeungphd
Final published version, 3.06 MB, PDF document

Text available via DOI:

https://doi.org/10.17635/lancaster/thesis/1673
Final published version

View graph of relations

Research output: Thesis › Doctoral Thesis

Published

Ging-Fung Yeung

More...

Publication date	2022
Number of pages	219
Qualification	PhD
Awarding Institution	Lancaster University
Supervisors/Advisors	Garraghan, Peter, Supervisor Friday, Adrian, Supervisor
Publisher	Lancaster University
<mark>Original language</mark>	English

Abstract

Deep Learning (DL) applications are growing at an unprecedented rate across many domains, ranging from weather prediction, map navigation to medical imaging. However, training these deep learning models in large-scale compute clusters face substantial challenges in terms of low cluster resource utilisation and high job waiting time. State-of-the-art DL cluster resource managers are needed to increase GPU utilisation and maximise throughput. While co-locating DL jobs within the same GPU has been shown to be an effective means towards achieving this, co-location subsequently incurs performance interference resulting in job slowdown.

We argue that effective workload placement can minimise DL cluster interference
at scheduling runtime by understanding the DL workload characteristics and their respective hardware resource consumption. However, existing DL cluster resource managers reserve isolated GPUs to perform online profiling to directly measure GPU utilisation and kernel patterns for each unique submitted job. Such a feedback-based reactive approach results in additional waiting times as well as reduced cluster resource efficiency and availability.

In this thesis, we propose Horus: an interference-aware and prediction-based
DL cluster resource manager. Through empirically studying a series of microbenchmarks and DL workload co-location combinations across heterogeneous GPU hardware, we demonstrate the negative effects of performance interference when colocating DL workload, and identify GPU utilisation as a general proxy metric to determine good placement decisions. From these findings, we design Horus, which in contrast to existing approaches, proactively predicts GPU utilisation of heterogeneous DL workload extrapolated from the DL model computation graph features when performing placement decisions, removing the need for online profiling and isolated reserved GPUs. By conducting empirical experimentation within a medium-scale DL cluster as well as a large-scale trace-driven simulation of a production system, we demonstrate Horus improves cluster GPU utilisation, reduces cluster makespan and waiting time, and can scale to operate within hundreds of machines.

Research

Electronic data

Text available via DOI:

Proactive Interference-aware Resource Management in Deep Learning Training Cluster

Abstract

Quick Links

Connect With Us

Faculties & Depts

Contact Us