Research

Technical Report Results

Technical Report TR563:
Reliability in LAM/MPI Requirements Specification

Andrew Lumsdaine, Jeffrey M. Squyres, and Brian Barrett
(Jun 2002), 29 pages
Abstract:
This document describes the software requirements necessary to allow a parallel software application running on top of LAM/MPI to detect and recover from a catastrophic fault such as a compute node crash. The requirements include

- Definition and categorization of failures to be handled by a reliable LAM/MPI application,

- The behavioral (implementation) and interface requirements for LAM/MPI to provide reliable execution capabilities, and

- The development of a preliminary design interface between LAM/MPI and an application wishing to recover from such an error.

Available as: