Next: 7. Device Management Up: An Architecture for Parallel Previous: 5. Memory Management

Subsections

6. Process Management

Like memory management, DSI provides complete process management. We begin this chapter with a description of the key events in a suspension's life-cycle: process creation, activation and termination. This provides a basis for a discussion of higher level process management issues: synchronization, scheduling, controlling parallelism, and speculative computation.

6.1 The Process Life Cycle

Traditionally, process creation is associated with both the allocation of process state and the manipulation of scheduling structures within the kernel. In DSI, process creation and activation are treated as two fundamentally separate events. Process creation is just another form of allocation that is nearly as fine-grained as cell-allocation (suspensions are, in fact, a sequence of four cells).

All suspensions are created with the suspend instruction, which initializes suspension records in the heap. The created suspension inherits some of its fields from the registers of the context window of the currently running process; the new suspension record can be swapped directly in and out of a context window to instantiate the process. Most suspensions require a convergence context to be useful: a cell in the heap containing the suspension reference that is the location for storing the manifest result of computing the suspended computation. A process executing a typical suspending construction code sequence will execute one or more suspend instructions interspersed with new instructions in building a data structure.

A suspension is activated in one of two ways:

It can be coerced by a probe; in which case the system will schedule it implicitly, or
it can be explicitly scheduled by a suspension-aware concurrency primitive in the language.

The first situation occurs when a running process traverses a structure containing suspensions. By default, structure accesses are ultimately handled at the DSI virtual machine level by conditional load instructions (e.g. hdc and tlc). These instructions are sensitive to suspension references (see chapter 3), generating a special signal if the result of the load would be a suspension reference. This signal causes a trap to the kernel, which suspends the probing suspension and schedules the target suspension for execution. The second situation occurs only for a special class of primitives which explicitly manipulate suspensions. This typically includes concurrency constructs in the surface language.

A scheduled suspension is considered an active process. The kernel resumes an active suspension by loading it into an open context window and transferring control to that window. The suspension resumes execution at the control point in the K register. A suspension in this state is a running process. Note that there may be any number of active processes on a given node, some of which are swapped into context windows. There is only one running process per node, namely, the one in a context window that has control of the processor.

The end result of a suspension's execution is almost invariably to compute a value which replaces its own reference in the heap, an operation known as converging to a value. If the suspension is not referenced from another location this causes the suspension to un-reference itself; its state will be recovered at the next garbage collection. Converging is normally accomplished by a sting (store) operation followed by a converge system call, which signals a process' intention to terminate. The kernel responds by swapping the suspension out of its context window and selecting another active suspension to run.

Sometimes, a suspension may store a value that extends the structure of the convergence context, and embeds itself farther down in the structure. For example, the code for an iterating stream process might append items to the tail of a list and then resuspend itself at the new end of the stream. This resuspend behavior is accomplished with the detach operation, a system call which is similar to converge, but has the effect of resetting the control register K such that if and when the suspension is reactivated it resumes execution at the specified control point. Thus, detach is essentially a call to relinquish the processor.

6.2 Interprocess Synchronization

**Figure 14:** Process Dependences
(a) Actual dependences	(b) dependence graph	(c) Dependence stack

We illustrate interprocess synchronization in DSI with an example. Consider the situation depicted in figure 14a. Suspension Ais activated as a running process; during the course of its computation it probes suspension B. A dynamic dependence is established between Aand B; process Aneeds suspension B's result. This causes suspension Bto be activated, in accordance with demand-driven scheduling. When process Bconverges or detaches with a result, A's demand is satisfied and it can be restarted. Clearly, we need some method of synchronizing execution between Aand B, since Acannot make progress until Bconverges with a result. Our suspending construction model requires that this synchronization be performed transparently by the system on behalf of processes.

How we implement this dependence synchronization can have a dramatic effect on performance. Our choices fall into two broad categories: polling and blocking synchronization. The difference between these two methods is primarily in where the responsibility lies for rescheduling A. In a polling implementation, the system either periodically wakes up Ato reattempt its probe, or checks B's convergence cell on behalf of Aat some interval, rescheduling Awhen a manifest value has appeared. With blocking synchronization, Ais blocked indefinitely (removed from the schedule) and the act of rescheduling Ais associated with B's convergence.

Both synchronization methods have their strengths and weaknesses:

1.: Polling does not require any external information about the dependences between processes to be maintained, since that information is encoded in the heap sharing relationships and in the code of the various processes themselves. Blocking synchronization requires that dependences be dynamically recorded and associated with each process so that blocked suspensions can be explicitly rescheduled when their dependences are satisfied.
2.: Polling does not require any special action or communication when a process converges; eventually the polling mechanism will succeed in restarting A. Blocking synchronization requires that the suspensions blocked on a process be explicitly rescheduled when that process converges.
3.: For polling, the waiting interval is critical, since the act of polling wastes processor time that could be devoted to other useful processes (we do not want to busy-wait). The interval between polls should be spent executing any other ready processes on the node. In some cases it may be necessary for the processor to artificially extend its polling interval with idle periods, since flagrant busy waiting could have deleterious effects on the processor network [LD88,PCYL87]. This might be the case, for example, early on in the computation before the program has established many parallel tasks.
4.: Synchronization latency refers to the lag between the time that a suspension has converged and the time that a process depending on it is actually resumed. This can occur under both synchronization methods, but is only a concern for polling. If the polling processor is idle during the latency period (see item 3, above) then that time has been wasted. The concern here is for the aggregate time wasted over the course of a program's execution due to synchronization latency. Blocking synchronization does not suffer this problem because blocked tasks are immediately rescheduled, so an idle processor would take it up immediately.

Given 1 and 2, we would prefer to poll, provided we could select an ideal polling interval (see item 3, above) for each blocked process, such that negligible time and resources are spent on the actual polling. Unfortunately, that is a difficult problem, since it all depends on the length and termination of individual process' computations. The wide range of possible values makes even a heuristic choice difficult (even using a dynamically changing interval). Our experience has shown that explicit synchronization is very effective and worth the extra overhead involved in maintaining dependences. However, polling's advantage in (1) is important and we shall return to the synchronization issue later in section 6.7.

6.3 Tracking Dependences

Let us return for a moment to our example. Suppose that in the course of execution, process Bprobes suspension C; Cprobes D, and so on. The result is that a dynamic chain of dependences is established between suspensions. This dependence chain can be visualized as a dependence graph as shown in figure 14b. The dependence graph shows the temporal data dependences between a subset of suspensions in the heap at a given point in time.

Because of the lazy nature of suspending construction, dependence chains can grow hundreds or thousands of processes deep. Assuming that we are going to do explicit synchronization, we need to track these dependences with an actual structure that mirrors, but reverses, the dynamically unfolding chain of dependences. We depict our reverse dependence graph like a normal dependence graph, only with the arrows reversed, as in figure 14c. Using our dynamically constructed reverse dependence graph we can easily determine which processes should be resumed when a given process converges. The main problem with this approach is maintaining an accurate mapping between our reverse dependence graph (an actual structure that the system manipulates) with the dynamically changing state of actual dependences in the system. As we shall see, this assumption can be problematic, especially in the presence of speculative computation (section 6.7).

6.3.1 Distributed Demand-Driven Scheduling

Our first logical choice for implementing a reverse dependence graph is a stack, with elements pointing to suspensions in the dependence chain. The top of the stack points to the currently executing process. Every probe that results in a suspension forces a stack push, so that the top of the stack always points at the fringe of the expanding computation, and backward links record the dependences between suspensions resulting from a series of probes. When the current process converges this dependence stack is popped, and the suspension underneath is rescheduled.

Our dependence stack then, is a layer of scheduling infrastructure laid out over the computation as it unfolds. Probes result in pushes, and convergence results in pops; we assume that the process at the top of the stack is the next ready process in that dependence stack.

Dependence stacks are the basic scheduling handles used by the system. When a probe results in a suspension the kernel pushes the dependence stack (as described above) and then examines the suspension reference. If the suspension is local it can proceed to swap it in and execute it. If it is a remote suspension, the kernel sends a scheduling message to the remote processor containing a pointer to the dependence stack, and moves on to examine messages in its own scheduling queues (see section 4.3). When the remote processor gets around to processing this message, it loads the suspension reference from the top of the stack and proceeds to swap it in and execute it. If that suspension probes yet another suspension the same scenario recurs. When a process converges, the kernel pops the dependence stack and checks the suspension reference at the top. If it is a local reference, the suspension is swapped in and executed. If it is nonlocal, the dependence stack is sent to the remote processor and the current processor looks for other stacks in its message queues.

What we have just described is the basic mechanism for distributed demand-driven scheduling in DSI. The fringe of the computation corresponding to a particular dependence chain shifts from processor to processor to take advantage of locality of execution for whichever suspension is currently pushing the fringe. The cost of this scheduling communication is discussed in chapter 4; it is considerably less than the cost of context swapping over the network, which would involve the loading and storing of suspension contexts, perhaps multiple times, plus any additional overhead due to non-local stack manipulation, etc. For details refer to section 5.2.3.

6.4 Creating Parallelism

As we have seen, demand-driven scheduling leads to extending and collapsing existing dependence chains; at any given time only the suspension at the top of a dependence stack is active, the others are blocked on dependences. If there is only one dependence chain then only one processor would be busy at a time, executing the suspension at the top of the sole dependence stack. Thus, any parallelism in the system is due to multiple dependence chains (and correspondingly, dependence stacks). Each dependence stack represents a point on the fringe of the computation tree that is not currently dependent on any other process.

It might seem logical to create and assign a dependence stack to each processor; if each processor is busy handling a stack the entire machine is utilized. Actually, we want multiple pending stacks for each processor. The reason is due to our requirement that suspensions execute locally; if a processor passes its current dependence stack to another node we do not want it to be idle. Rather, it should take up any dependence stacks that have been passed to it from other processors. Given a balanced distribution of suspensions and a small backlog of waiting dependence stacks, any processor should always have work to do. Kranz [DAKM89] briefly discusses the merits this kind of task backlog.

6.5 Static vs. Dynamic Parallelism

We can classify parallelism in our system under two broad categories. Static parallelism is due to pre-existing multiple chains of demand. An example of static parallelism would be multiple initial output devices, arising from a system of multiple output windows or a multi-user/multi-terminal system. One advantage of static parallelism is that there is no possibility of overloading the machine with too much parallelism; the amount of parallelism never increases, and can be initially matched to the size and capabilities of the machine.

Dynamic parallelism is created in response to explicit scheduling requests by concurrent primitives. These requests are the source of branching in the dependence graph. Each branch forms a new dependence chain; the amount and type of branching is determined by the primitive, and possibly by the number of suspensions in its argument. For example, consider a hypothetical primitive, pcoerce, that implements a form of parallel coercion, scheduling each suspension in its argument list and returning when they have all converged.

Concurrent primitives are the primary source of parallelism in Daisy. This is consistent with the underlying philosophy of the language; namely, that list construction produces suspended processes, and primitives coerce those processes is various ways (including in parallel). See section 2.3 for details.

6.6 Controlling Parallelism

Dynamic parallelism introduces the problem of controlling excessive parallelism. Once the machine is saturated with processes (i.e. is fully utilized, with a sufficient backlog) additional parallelism is not beneficial, and may even be harmful. Too many concurrent tasks result in greater peak memory usage then if the processes were coerced sequentially. Furthermore, additional processes result in more context switching, reducing overall throughput. Thus some method of constraining parallelism is required so as not to overwhelm the machine with active processes. This mechanism must be dynamic, since the amount and type of parallelism all depend on the program and data.

One technique to constraining parallelism is to tie process creation (i.e. granularity) to the load of the machine [GM84,Moh90]. For explicitly eager parallel languages (e.g. futures), this may be possible. For Daisy, the laziness of suspensions is a semantic requirement; there is no way to randomly ``inline'' a suspension without changing the semantics of the language^6.1. Note that for lazy tasks, it is not the actual creation of tasks that leads to excess parallelism but rather the excess scheduling of tasks (i.e. by primitives). However, that there are other valid reasons to reduce granularity. This is discussed further in section 6.9.

With this in mind, we might consider tying process scheduling to current load conditions. For example, if the load is heavy, pcoerce might only schedule some of its suspensions in parallel, and coerce the others serially. However, since suspensions can only run locally, this would require pcoerce to access the load conditions of all the processors on which it might need to schedule suspensions. This would mean an unacceptable amount of overhead, in addition to placing the burden of these decisions on the surface language. Instead, DSI follows the philosophy of total process management. Language primitives can make any number of scheduling requests; the kernel is responsible for throttling parallelism to avoid the danger of excessive parallelism.

DSI's approach to controlling parallelism is based on our load-sensitive distributed allocation policy outlined in section 5.2.2. Since suspensions only execute locally, the load-sensitive allocation automatically limits the number of new processes created on a given processor by other processors. This in turn limits the possible number of scheduling requests made to a processor by other processors for those suspensions. Thus the only unlimited parallelism that a processor needs to worry about comes from local scheduling requests; i.e. made by local processes. This suggests a two-tiered approach for differentiating between local and remote scheduling requests. Each processor maintains one queue for incoming remote scheduling requests and a separate area for local scheduling requests. The two structures have distinct scheduling priorities. If a processor always favors its incoming scheduling queue (requests from other processors to execute local suspensions) then it automatically favors the status quo of current parallelism load on the machine, and thus discourages parallel expansion occurring from its local scheduling area. The allocation block size and allocation server responsiveness (see sections 5.2.1 and 5.2.2) indirectly determine the current load for the reasons outlined above. These factors affect the number of possible outstanding remote requests for a processor's local suspensions, thus controlling the backlog of multiple outstanding dependence stacks.

If there are no outstanding scheduling requests in a processor's incoming queue it obtains a scheduling request (dependence stack) from the local area. If this process should probe a remote suspension, it will be blocked on the dependence stack and a scheduling request will be sent to the remote processor. Note that this effectively raises the blocked task into a higher priority level, since it will be rescheduled in the sending processor's incoming queue after the probed suspension converges on the remote processor. Thus the absence of work in a processor's incoming queue results in that processor augmenting the level of parallelism in the overall system by expanding tasks in its local area.

The type of parallel expansion occurring in this way depends on whether we use a stack or a queue for the local scheduling area. A queue results in a breadth-first expansion; a stack results in a depth-first expansion. Superficially, a queue might seem to be the better choice. Breadth-first expansion expresses parallelism at higher levels in the process decomposition, which is generally preferable. Depth-first expansion limits the exponential growth of parallel decomposition. Note, however, that parallel expansion due to our distributed allocation/load-balancing scheme also affords breadth-first parallelism^6.2, and in fact, this is exactly the mechanism that is used to distribute parallelism evenly. Our remote queue represents the desired parallelism load; our local scheduling area is excess parallelism. With this in mind, we choose a LIFO approach for the local scheduling queue, which further throttles parallel expansion.

The priorities of the incoming queue and stack are easily handled by DSI's signal mechanism; these structures are implemented by the prioritized interprocessor message queues described in section 4.3. This mechanism insures a timely interruption of the processor should a remote scheduling request become available.

6.7 Conservative vs. Speculative Parallelism

Dynamic parallelism can be further classified into two types. Our hypothetical pcoerce primitive is a form of conservative parallelism [Jon87], also called mandatory computation [Osb90]. In conservative parallelism the results of the parallel computations spawned are known to be needed (in the case of pcoerce because that's what the primitive requires).

In contrast to conservative parallelism is speculative parallelism [Jon87] (also called speculative computation [Osb90]). The results of speculative processes are not known to be needed. Speculative processes are usually scheduled based on some probability of need. For example, consider a speculative conditional

that schedules its then-part and else-part in parallel with the evaluation of the predicate, on the idea that it will make some headway on the result regardless of the outcome of the test. There are a number of useful applications of speculative computation; section 2.3 contains several examples. Osborne [Osb90] provides a partial taxonomy for classifying speculative computation types.

6.7.1 Managing Speculative Computation

Speculative computation offers additional sources of parallelism over what may be available from conservative parallelism alone, but at the same time it introduces significant resource management problems:

1.: Speculative processes should ideally only use idle resources; if there is relevant work to be done (mandatory computation), then those conservative processes should get priority.
2.: Speculative computation can exacerbate the problem of excessive parallelism, outlined in section 6.5; thus constraining parallelism takes on added importance.
3.: Speculative processes can become useless during the course of execution; it then becomes necessary to find a way to remove these tasks and their descendants from the system so that they don't continue to use up resources that could be devoted to other processes.

We will describe how these issues are addressed by iterative refinements to our scheduling model for conservative parallelism.

The first problem, that conservative processes take priority over speculative processes, is easily satisfied by having each processor maintain a separate, lower priority queue for speculative processes. If a speculative process probes or schedules any other suspensions they are also scheduled to the appropriate processor's speculative queue. Thus, once a task has been scheduled speculatively, neither it nor any of its descendants should be scheduled to a conservative queue. At first, this approach does not seem to provide for contagion [Osb90]; namely, if sharing relationships are such that a conservative process comes to depend on a speculative task (e.g. by probing it after it has been scheduled speculatively elsewhere, like our speculative conditional) the speculative process should be upgraded to conservative, and any tasks it depends on should be upgraded as well. We will address this issue in our discussion on sharing in section 6.8.

Our second concern, that speculative parallelism be controlled, is satisfied by the same techniques we used in controlling conservative parallelism: using a stack to schedule local suspensions. Since we are segregating speculative and conservative processes, we add a local, speculative stack in addition to our speculative queue described above. The stack works in exactly the same way as the conservative stack described in section 6.6; local speculative suspensions are pushed on the stack, and remote speculative suspensions are appended to the speculative queue on the remote processor. The speculative stack has lower priority than the speculative queue, which favors existing parallelism over new parallelism. Thus, we have two queues and two stacks per processor, which are serviced in the following priority (highest to lowest):

1.: Conservative queue.
2.: Conservative stack.
3.: Speculative queue.
4.: Speculative stack.

The implementation of these structures are handled by DSI's multi-queue cascading-priority communication mechanism presented in chapter 4.

**Figure 15:** A Speculative Conditional

Our third concern, removing useless tasks, is more complicated. To illustrate the issue, consider a speculative conditional which produces a process decomposition as shown in figure 15. The then-part and else-part are speculative computations that have been scheduled in parallel with the evaluation of the predicate. Both speculative processes have started demand chains that are pushing the fringe of the computation forward; both have also initiated additional parallelism that schedules additional processes into the system. Once predicate has been evaluated, one of the dependence graphs emanating from the two speculative processes has suddenly become relevant (conservative) and the other irrelevant (useless). For example, if our predicate evaluates to false, we would like to remove processes G, J, L, Mand Nfrom the system as soon as possible. Presumably only L, Mand Nare actually scheduled, so their removal will suffice to remove any useless processes (G, J) blocked on them, under our dependence stack model.

Ideally, the system should spend as little effort as possible on useless task removal; any effort spent in doing so is chalked up to pure overhead that offsets any gains made in profitable speculation. The approaches used by current parallel symbolic systems generally fall into two categories: explicit killing of tasks [GP81,Hem85] and garbage collection of tasks [BH77,Mil87].

The first approach, explicit killing of tasks, relies on the kernel being able to determine when a task has become useless. At that point the kernel can spawn a termination task, which begins at the root of the speculation process subtree and recursively kills tasks. Alternatively, the kernel can cut the useless processes out of the schedule directly. This can be difficult to do if a distributed scheduling structure is used, as in DSI. Both methods require the kernel to be able to identify all descendants of the useless task and determine whether they are useless or not; because of sharing, some of them may not be useless and should be spared.

The garbage collection approach involves modifying the garbage collector to remove useless tasks from schedules, blocking queues, and other scheduling infrastructure. This approach assumes that useless tasks have become unreferenced from the computation graph. In order to keep them from being retained by scheduling references, weak pointers or weak cons cells [Mil87] must be used to build scheduling infrastructure containing pointers to tasks so that they aren't retained unnecessarily. A potential problem with this approach is that useless tasks continue to remain in the system until garbage collection [Osb90]. One way to prevent this is through the use of priorities, especially if priorities are already being used to give conservative tasks preference. When a task is discovered to be useless, priorities can be propagated down through the speculative process subgraph to downgrade the status of all descendant tasks. As with the killing approach, the kernel must be able to distinguish the useful from the useless as it descends the subtree; this may require a sophisticated priority combining scheme [Osb90].

6.7.2 Demand Coefficients

The methods described above are both rather active; that is, the kernel must pro-actively detect useless tasks and attempt to restructure the schedule to mirror the changing state of actual dependences occurring in the program. As we have discussed, this is difficult thing to do, and the sophistication of the approaches presented above attest to that fact. DSI uses a third, more passive, approach to speculative task removal; it controls the extent of speculative computation through the use of bounded computation. The idea is fairly simple; instead of scheduling speculative tasks like conservative tasks and then trying to remove them from a distributed schedule, only schedule them for a bounded amount of computation. This bound is given by a demand coefficient embedded in the suspension itself, which directly controls the amount of processing resources (time, memory allocation, etc.) it receives. If a process exhausts its demand coefficient before converging the owning processor removes it from the schedule. In order to keep it going, the scheduling process must coax it again; a process may need to be coaxed a number of times before it finally converges.

This approach supports a number of speculative models [Osb90]. Simple precomputing speculation is accomplished by a single coax. This type of speculation simply channels some effort into executing a suspension on the chance that it's result will be needed later. Multiple-approach speculation is more common, and refers to scheduling a number of tasks where not all of them are necessary. Constructs falling into this category are our speculative if (see above), AND-parallelism, OR-parallelism, and many others (see section 2.3). These primitives operate on the principle of strobing: coaxing a number of suspensions repeatedly until one converges. This may be enough for the scheduling task to produce a result; if not, the task may resume strobing until another converges, and so on. The kernel provides routines to handle this strobing on behalf of primitives^6.3 this simplifies the construction of speculative primitives, and allows the kernel to optimize the strobing behavior.

There are a number of advantages in this approach to managing speculative tasks:

No effort is required on behalf of the kernel to identify and remove useless tasks. Speculative tasks are scheduled as such from the outset; from then on the process and any that it probes are scheduled in speculative scheduling queues. Speculative tasks will eventually drop out of the distributed schedule once the strobing process stops coaxing them.
Demand coefficients provide a natural way to prioritize speculative tasks. A process's demand coefficient directly correlates to the amount of processing resources it receives; a higher coefficient will result in more attention.

In order for bounded computation to be effective, our system must be able to extend a process's bound to other processes that it probes or schedules indirectly. To do this we modify our demand-driven scheduling scheme to propagate demand coefficients and observe the limits of a process' computation bound. This essentially quantifies the demand propagating through the computation graph.

In implementation terms, this entails the transfer of coefficients along the fringe of the computation; i.e. at the top of the process dependence chains. When a dependence stack is extended, the kernel transfers some or all of the scheduling process' coefficient to the target suspension. If the target process converges, the kernel transfers the remaining portion of the coefficient back to the process being resumed. If a process exhausts its coefficient before converging, the kernel unwinds the dependence stack until it discovers a suspension with a positive demand coefficient, and schedules it on the owning processor.

The size and transfer characteristics of demand coefficients must be carefully determined, in much the same way as allocation blocksize is for distributed memory allocation. If set too small, then coaxed computation does not get very far, per coax, and coaxing overhead begins to overtake speculative processing; If set too large, useless tasks will remain in the system for longer periods, although they only compete with other speculative tasks. A strict requirement for the transfer function is that coefficients should monotonically decrease over the course of several transfers; otherwise a chain of speculative demand might live on indefinitely.

6.8 Sharing and Cycles

**Figure 16:** Heap Sharing

What happens if a suspension is probed by more than one process (e.g. due to list sharing)? Since we only execute suspensions locally, there is no contention due to multiple processors trying to execute the same suspension. Each processor wanting to schedule the suspension will send a scheduling request to the processor which ``owns'' the suspension. This can result in several references to the same suspension in a processor's scheduling area. Note that the references may be spread across all four of the scheduling structures (conservative queue, conservative stack, speculative queue, speculative stack), according to all the places where the suspension was scheduled and how it was scheduled. This handles the case where a suspension is scheduled speculatively by one process and later is probed by a conservative process. In this case the speculative process should be upgraded to conservative status, a situation Osborne calls contagion [Osb90]. The first time the process will be scheduled to a speculative queue (or stack), but the second probe will result in the process also being queued on a conservative queue (or stack). Since the conservative queue (stack) has priority over the speculative queue (stack) the process is automatically upgraded. Note that a suspension's demand coefficient reflects sharing relationships; if several processes coax the same suspension its demand coefficient will be larger than if a single process coaxed it. This also effectively upgrades its priority relative to other tasks.

6.8.1 Embedding Dependences

It might seem awkward to maintain dependence information outside of the process context. In other words, to handling sharing, why not simply add a link to each suspension on which you could enqueue other suspensions that are blocked on it. Then, when a process converges, the system could simply reschedule all suspensions in the wait queue. This is the approach used by Multilisp [RHH85] and most other implementations. There are several reasons why DSI does not do this. First, speculative computation renders all dependences to be potentially speculative. A given demand chain may be the result of a speculative computation. If we record all dependences in the process state we may retain pointers from needed processes to speculative processes that have since become useless. The problems with this approach and the solutions taken by other systems are outlined above.

The problem with speculative processes points out how dependences are a temporal phenomenon, which is the reason it is problematic for the system to accurately dynamically track the dependence relationships between processes. There are other reasons besides speculation that might result in a dependence changing without the system's awareness. For example, a suspension might be moved to a new location by another process, or a location containing a suspension might be overwritten by another process^6.4. Therefore it is somewhat of a philosophical decision not to embed temporal dependence information in suspensions themselves, since it may later prove inaccurate. The actual state of dependences is recorded in the suspension state already, in the form of their actual pointer values and code that they are executing. Thus the truest form of demand-driven scheduling would be to poll repeatedly from the root of the entire computation tree, allowing the dependences to ``shake out'' naturally. Unfortunately, this approach is not feasible on stock hardware, and would result in many suspensions being swapped in repeatedly only to block on their dependences and be swapped out again. DSI's current approach using dependence stacks is a compromise; it prevents the most common form of polling inefficiency, while still providing a method (demand coefficients) for utilizing the programs actual dependences to handle speculation.

6.8.2 Cycles

Cycles can arise in our schedule due to circular dependences. A process that ultimately depends only on itself represents a logical programming error. DSI makes no attempt to recover these, but a solution might be possible using monotonically decreasing demand coefficients.

6.9 Increasing Granularity

The lazy nature of suspending construction coupled with a fine process granularity often combine to create potentially large dependence chains. These chains eat up valuable space and introduce stack overhead that cuts into useful processing time. One of the interesting open questions of this research is how to reduce the size of these chains.

One way to increase process granularity is by replacing invocations of suspend (especially for trivial suspensions) to inlined code or recursive function calls where appropriate. This helps, but not overly so because of the high level of optimization already performed for suspension processing. Informal tests indicate that a function call is only slightly faster than locally coercing a suspension. How much faster depends on how many registers need to be saved on the stack to perform a strict function call; because suspensions are so small, a context switch can be faster than a function call that has to perform several allocations to save and restore registers. Also, many suspensions cannot be eliminated or inlined and still retain proper laziness. Compilation, including strictness and dependence analysis can help to improve granularity. Demand coefficients may also provide a way to implement bounded eagerness; this idea is explored further in chapter 8.

Next: 7. Device Management Up: An Architecture for Parallel Previous: 5. Memory Management

Eric Jeschke
1999-07-08