Research

Technical Report Results

Technical Report TR686:
A Composable Runtime Recovery Policy Framework Supporting Resilient HPC Applications

Joshua Hursey & Andrew Lumsdaine
(Aug 2010), 12 pages
Abstract:
An HPC application must be resilient to sustain itself in the event of process loss due to the high probability of hardware failure on modern HPC systems. These applications rely on resilient runtime environments to provide various resiliency strategies to sustain the application across failures. This paper presents the ErrMgr recovery policy framework providing applications the ability to compose a set of policies at runtime. This framework is implemented in the Open MPI runtime environment and currently includes three policy options: run-through stabilization, automatic process recovery, and preemptive process migration. The former option supports continuing research into fault tolerant MPI semantics while the latter two provide transparent, checkpoint/restart syle recovery. We discuss the impact on recovery policy performance for various stable storage strategies including staging, caching, and compression. Results indicate that some stable storage configurations designed to provide low overhead during failure-free execution often negatively impact recovery performance.

Available as: