Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning

Computing and Communications

Associated organisational unit

MSF PhDs (2018/19)

Electronic data

2023BorowiecPhD
Final published version, 5.61 MB, PDF document
Available under license: CC BY-NC-ND: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Text available via DOI:

https://doi.org/10.17635/lancaster/thesis/2049
Final published version

View graph of relations

Research output: Thesis › Doctoral Thesis

Published

Damian Borowiec

More...

Publication date	03/2023
Number of pages	384
Qualification	PhD
Awarding Institution	Lancaster University
Supervisors/Advisors	Garraghan, Peter, Supervisor Harper, Richard, Supervisor
Award date	30/06/2023
Publisher	Lancaster University
<mark>Original language</mark>	English

Abstract

Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines.
The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process.
In this thesis, we investigate the reasons behind prolonged DL auto-tuning and
quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices.

Research

Associated organisational unit

Electronic data

Text available via DOI: