Indiana University Bloomington

School of Informatics and Computing


Computer Science Program







 Home

 Contacts

 Courses

 Academics

 Careers

 Research

 People

 Calendar

 Resources

 Facilities



Pervasive Technology Labs

Computing Research Association

Association for Computing Machinery

Technical Report TR635:
A Checkpoint and Restart Service Specification for Open MPI

Joshua Hursey and Jeffrey M. Squyres and Andrew Lumsdaine
(Jul 2006), 8
Abstract:
HPC systems are growing in both complexity and size, increasing the opportunity for system failures. Checkpoint and restart techniques are one of many fault tolerance techniques developed for such adverse runtime conditions. Because of the variety of available approaches for checkpoint and restart, HPC system libraries, such as MPI, seeking to incorporate these techniques would benefit greatly from a portable, extensible checkpoint and restart framework. This paper presents a specification for such a framework in Open MPI that allows for the integration of a variety of checkpoint/restart systems and protocols. The modular design of the framework allows researchers to contribute to specialized areas without requiring knowledge of the entirety of the code base.

Available as:

There is help available if you want further information about the available file formats and software to display and print these files.

Return to the Technical Report Index








Valid HTML 4.01!