Home > Research > Publications & Outputs > Smart multi-task scheduling for OpenCL programs...
View graph of relations

Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published
Close
Publication date2014
Host publication21st Annual IEEE International Conference on High Performance Computing (HiPC 2014)
PublisherIEEE
Number of pages10
ISBN (print)9781479959754
<mark>Original language</mark>English
Event21st annual IEEE International Conference on High Performance Computing (HiPC 2014) - , India
Duration: 17/12/201420/12/2014

Conference

Conference21st annual IEEE International Conference on High Performance Computing (HiPC 2014)
Country/TerritoryIndia
Period17/12/1420/12/14

Conference

Conference21st annual IEEE International Conference on High Performance Computing (HiPC 2014)
Country/TerritoryIndia
Period17/12/1420/12/14

Abstract

Heterogeneous systems consisting of multiple CPUs and GPUs are increasingly attractive as platforms for high performance computing. Such platforms are usually programmed using OpenCL which provides program portability by allowing the same program to execute on different types of device. As such systems become more mainstream, they will move from application dedicated devices to platforms that need to support multiple concurrent user applications. Here there is a need to determine when and where to map different applications so as to best utilize the available heterogeneous hardware resources. In this paper, we present an efficient OpenCL task scheduling scheme which schedules multiple kernels from multiple programs on CPU/GPU heterogeneous platforms. It does this by determining at runtime which kernels are likely to best utilize a device. We show that speedup is a good scheduling priority function and develop a novel model that predicts a kernel's speedup based on its static code structure. Our scheduler uses this prediction and runtime input data size to prioritize and schedule tasks. This technique is applied to a large set of concurrent OpenCL kernels. We evaluated our approach for system throughput and average turn-around time against competitive techniques on two different platforms: a Core i7/Nvidia GTX590 and a Core i7/AMD Tahiti 7970 platforms. For system throughput, we achieve, on average, a 1.21x and 1.25x improvement over the best competitors on the NVIDIA and AMD platforms respectively. Our approach reduces the turnaround time, on average, by at least 1.5x and 1.2x on the NVIDIA and AMD platforms respectively, when compared to alternative approaches.