Auto-tuning MPI Collective Operations on Large-Scale Parallel Systems

Associated organisational units

Electronic data

1570538287
Rights statement: ©2019 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Accepted author manuscript, 650 KB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Text available via DOI:

https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00101
Final published version

View graph of relations

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Wenxu Zheng
Jianbin Fang
Juan Chen
Xiaodong Pan
Hao Wang
Chun Huang
Xiaole Sun
Tao Tang
Zheng Wang

More...

Publication date	3/10/2019
Host publication	The 21st IEEE International Conference on High Performance Computing and Communications
Publisher	IEEE
Pages	670-677
Number of pages	8
ISBN (electronic)	9781728120584
ISBN (print)	9781728120591
<mark>Original language</mark>	English

Abstract

MPI libraries are widely used in applications of high performance computing. Yet, effective tuning of MPI colletives on large parallel systems is an outstanding challenge. This process often follows a trial-and-error approach and requires expert insights into the subtle interactions between software and the underlying hardware. This paper presents an empirical approach to choose and switch MPI communication algorithms at runtime to optimize the application performance. We achieve this by first modeling offline, through microbenchmarks, to find how the runtime parameters with different message sizes affect the choice of MPI communication algorithms. We then apply the knowledge to automatically optimize new unseen MPI programs. We evaluate our approach by applying it to NPB and HPCC benchmarks on a 384-node computer cluster of the Tianhe-2 supercomputer. Experimental results show that our approach achieves, on average, 22.7% (up to 40.7%) improvement over the default setting.

Bibliographic note

©2019 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Research

Associated organisational units

Electronic data

Links

Text available via DOI: