Technical Report TR731:
Udayanga Wickramasinghe, Luke DAlessandro, Andrew Lumsdaine, Ezra Kissel, Martin Swany, Ryan Newton
Evaluating Collectives in Networks of Multicore/Two-level Reduction
(May 2017), 15
[This is an extended version of a paper under submission (not accepted yet)]
As clusters of multicore nodes become the standard platform for HPC, programmers are adopting approaches that combine multicore programming (e.g., OpenMP) for on-node parallelism with MPI for inter-node parallelism—the so-called “MPI+X”. In important use cases, such as reductions, this hybrid approach can necessitate a scalability-limiting sequence of independent parallel operations, one for each paradigm. For example, MPI+OpenMP typically performs a global parallel reduction by first performing a local OpenMP reduction, followed by an MPI reduction across the nodes. If the local reductions are not well-balanced, which can happen in the case of irregular or dynamic adaptive applications, the scalability of the overall reduction operation becomes limited. In this paper, we study the empirical and theoretical impact of imbalanced reductions on two different execution models: MPI+X and AMT (Asynchronous Many Tasking), with MPI+OpenMP and HPX-5 as concrete in- stances of these respective models. We explore several approaches of maximizing asynchrony with MPI+OpenMP, including using OpenMP tasking, as well as the case of MPI only, detaching X altogether. We study the effects of imbalanced reductions for microbenchmarks and for the Lulesh mini-app.Despite maximizing MPI+OpenMP asynchrony, we find that as scale and noise increases, scalability of the MPI+X model is significantly reduced compared to the AMT model.
- Available as: