Felipe Bertrand
febertra@cs.indiana.edu
Randall Bramley
bramley@cs.indiana.edu
Connecting remotely distributed components is an increasingly important problem in scientific computing. This problem occurs when components require special site licenses [], require access to unique data resources such as databases, sensors, or special purpose compute platforms, or when multidisciplinary applications will be composed from single discipline codes under active development by distributed teams [1,2]. An model of interoperating distributed components is likely to be used in major new efforts for modeling magnetically confined fusion energy in the U.S., Europe, and Japan [].
Remote method invocation is routinely used for connecting commercial components [,], but scientific computing brings in a new type of complexity: the individual components are often parallel and designed for distributed memory use with a SPMD [3] programming model. The additional complexity occurs because the numbers of processes may vary across the interacting components: an FFT component may be optimal when using a power-of-two number of processes, or a component representing a sensor array may have its number of processes determined by the number of sensors currently alive. Even when the components are isocardinal (have the same numbers of processes) sometimes only a subset of the processes need to communicate with other components. For example, a large partial differential equation solver may only need processes containing discretized boundary nodes to read data from a sensor array providing boundary conditions. Requiring all processes to participate in communications when only a subset actually exchange data is well-known to prevent scalability of an application.
This paper presents a Parallel-Remote Method Invocation (PRMI) model which generalizes the collective invocations of SPMD programming. The challenge of this generalization is double: first to introduce distribution (calling and callee methods are in separate executables), and second to cope with the discrepancy in the number of processes of calling and callee components (the ``MxN'' problem).
The novelty of this approach to PRMI is that we derive our model from the collective SPMD non-distributed model, rather than the serial RMI distributed model. As a result, the parallel-remote call is not made between components en masse. Instead, each process in the calling component communicates with a peer process in the callee component directly. The bottom line is that in the default case, where the two components communicating have the same amount of processes, the PRMI call defaults to the semantics of a regular SPMD collective call.
There are two issues that depart from the semantics of regular collective calls. The first one is the consideration of ``parallel arguments'', which are component-wise data structures that are spread over all the processes. One of the advantages of considering parallel arguments explicitly is that the redistribution can be done at the same time as the data transfer. The second issue, the ``MxN'' problem, is the possible mismatch in the number of processes of the communicating components.
Because the most common implementation of the SPMD programming model uses the Message Passing Interface (MPI) standard, we use the terminology of MPI in this paper. In particular, some familiarity is assumed with the concepts of MPI communicators, collective call semantics, and the rank of a process as a unique integer ranging from 0 to one fewer than the number of processes within the MPI communicator.
In the isocardinal case with
, the semantics of the PRMI call defaults to
that of a regular collective call. This means that each process of the
calling component communicates directly with a peer component in the
callee component. The framework must also guarantee that in subsequent
calls, the processes will be matched with the same peers. To
enforce this policy, processed are paired so that peers have the same
componentwise rank.
Arguments are passed directly from the calling processes to their peers. This means that each formal argument and return value will have a potentially different actual value in each pair of calling/callee processes.
Sometimes as in regular parallel libraries, callee components will require that some formal arguments have the same actual value across all the processes. However this condition is never enforced by the framework, and it is left up to the component developer to assure this at run time if desired.
In MPI, a communicator argument is used to define the set of processes that participate in any one collective call. In PRMI, we also use communicators for that same purpose. To accommodate the fact that in PRMI the set of calling processes is different from the set of processes receiving the call, the communicator argument that is passed to the receiving component is changed by the framework to reflect the set of processes at that side of the communication. However, note that the number of processes on either side of the call is the same. In particular, if only a subset of the processes in the calling component are participating in the call, only that same number of processes will be woken in the callee component to service the call (and they will be paired one-to-one with the calling processes).
The case
is also straightforward, because there are enough
processes in the receiving side to be paired with each calling
process. The only detail particular to this configuration is that some
of the processes in the receiving side (those with ranks
)
will never be woken by any call because they can never be paired with
any calling process.
Finally, we have the case in which
. This case is problematic
because there might be some processes in the calling component that
cannot be paired with any process in the callee component. In
particular, this will be the case for all processes in the calling
component with ranks
.
There are two problematic cases in
. The first one happens when
some of the calling processes (but not all) cannot be paired because
their ranks don't fall within the range of the ranks in the receiving
component. In this case, the behavior is to silently discard the call
originating in these unpaired processes. In particular, the regular
arguments will be lost (but not the parallel arguments, as we
will see later).
The second pathological case happens when none of the processes in the
originating component can be paired because their ranks are all
. In this case, if we silently discard all unpaired calls,
the callee component will not even realize that a call has been
made. We consider this case to be unacceptable.
We have two options to solve this problem. One is to perform the
matching between processes that do not have the same rank. For
example, if process with rank
is making a call, and there is no
equivalent rank on the other side, but the process with rank
is
available, then match the calling process
with the callee process
. The problem with this solution is that by the pigeonhole
principle, the framework cannot guarantee that each calling process
will be paired with the same callee process in subsequent calls. We
consider that this property is important because it is inherited from
the SPMD/collective call paradigm to which we want to adhere as much
as possible. For that reason, we have chosen to go for the second
option, which is to raise an exception at the caller whenever this
situation happens.
To avoid unexpected exceptions, the programmer can either check the
number of processes in the other side in advance (see section
4), or can follow the simple practice of always involving
process
in PRMI calls (process
is guaranteed to be paired). In
order to allow for maximum flexibility and to adapt to any style of
programming, the framework does not force any policy on this matter.
An immediate issue that follows from the discussion above is how to deal with the limitation that only as many callee processes are awaken as calling processes made the call. Indeed, if we have two steps of a computation that require a different amount of resources (e.g. two processors the first step and one hundred the second), we want all the processes allocated for the second step to be used.
In order to make this possible, the programmer can awaken all the processes in the callee component via a special API (see section 4). Processes that are awaken in this way are not paired, and all of their arguments are set to a default value of 0. The API also allows the application to check if a given process is paired or not. Return values and ``out'' arguments of unpaired processes at the callee side are discarded (except for parallel arguments).
Parallel data structures are spread over some or all the processes of a component. When a component makes a parallel-remote call and passes a parallel data structure as an argument, the data must be redistributed to adapt to the number of processes and processor layout of the other end.
In contrast with the regular collective call paradigm, this proposal offers explicit support for parallel arguments. The reason is that in the distributed case, the framework can perform the data transfer from caller to callee and the redistribution at the same time, thus avoiding unnecessary data movement.
Parallel arguments are not passed directly. In order to allow the receiving side to specify the data distribution layout before the data transfer is performed, the framework handles the callee processes a handler to the data, rather than the data itself. Using a special API, the application can set the receiving distribution layout and then trigger the transfer of the data.
int prmi_get_remote_nprocs(Port *port, int *nprocs); int prmi_awaken_all(Port *port); int prmi_paired(Port *port, bool *paired); int prmi_component_comm(MPI_Comm *comm); int prmi_paired_comm(MPI_Comm *comm); int pami_pa_set_layout(pa_handler *handler, pa_layout *layout); int pami_pa_recv(pa_handler *handler); int pami_pa_irecv(pa_handler *handler); int pami_pa_test(pa_handler *handler); int pami_pa_wait(pa_handler *handler);
Returns the number of processes in the remote component.
int prmi_get_remote_nprocs(Port *port, int *nprocs);
Awakens all the processes in the receiving side.
int prmi_awaken_all(Port *port);
Tells if the current process is paired with a remote process. Unpaired processes will have their arguments and return values discarded.
int prmi_paired(Port *port, bool *paired);
Returns a communicator grouping all the processes in the component.
int prmi_component_comm(MPI_Comm *comm);
Returns a communicator grouping all the processes that are paired.
int prmi_paired_comm(MPI_Comm *comm);
Sets the layout of a parallel argument before it is received.
int prmi_pa_set_layout(pa_handler *handler, pa_layout *layout);
Triggers the reception of a parallel argument.
int prmi_pa_recv(pa_handler *handler);
Triggers the non-blocking reception of a parallel argument.
int prmi_pa_irecv(pa_handler *handler);
Tests if the non-blocking reception of a parallel argument has been completed.
int prmi_pa_test(pa_handler *handler);
Waits for the completion of a non-blocking reception of a parallel argument.
int prmi_pa_wait(pa_handler *handler);
This document was generated using the LaTeX2HTML translator Version 2002-2 (1.70)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -dir /home/febertra/src/libPRMI/doc/writeup prmi.ltx
The translation was initiated by on 2004-06-23