Next: 4. The DSI Kernel Up: An Architecture for Parallel Previous: 2. A Primer

Subsections

3. Architecture of the DSI Machine

This chapter describes the design of DSI's virtual machine, the lowest level component of the DSI hierarchy (see figure 4).

**Figure 4:** DSI Virtual Machine

DSI's virtual machine provides a low-level abstraction of the physical machine architecture (memory interface, processor state, instruction set, interrupt mechanism and processor interconnection topology) that forms a portable foundation upon which the rest of DSI and Daisy is implemented. The virtual machine approach provides several significant benefits:

It provides a modular way to port to new target architectures. The rest of DSI is implemented on top of the virtual machine, and Daisy on top of DSI, so the virtual machine isolates most of the machine-specific implementation dependences. For reasonably similar architectures, only the virtual machine layer needs to be ported. Substantially different architectures may require parts of the microkernel to be rewritten for performance optimization, however.
It allows critical or missing functionality to be implemented as virtual instructions which can be mapped as efficiently as possible to the host architecture. This helps identify key areas in which hardware support could be helpful. For example, cell allocation is such a frequent occurrence during applicative language execution that it is supported by instructions in the virtual machine. This indicates that hardware support for allocation would be extremely helpful for efficient execution of applicative languages. We summarize these areas in section 3.6.
It documents the minimum hardware capabilities and functionality required to implement the system, and delineates (potential) hardware components from software. This provides a sort of high-level blueprint for constructing a machine for a native, self-hosted DSI environment or for implementing DSI directly on a target machine, bypassing the host operating system.

3.1 Memory Interface

's virtual machine specifies an architecture oriented toward list processing, and its memory interface reflects this orientation. The elemental addressable unit of memory is the cell, a 64-bit word that conforms to one of several fixed formats. Cells are classified at the storage management level by a particular configuration of data and pointer fields.

**Figure 5:** Cell Storage Types

Figure 5 shows three storage classifications for cells.

Unary cells contain thirty-two bits of raw data and a reference. The data field can be accessed and interpreted in one of several formats, including signed integer, single-precision floating point or packed bytes.
Binary cells have fields head and tail, each containing a reference.
Stack cells contain a byte of data, a pointer and a reference.

Cells have additional small bit fields used for various purposes such as garbage collection and processor synchronization. These fields are described where applicable. The memory interface is facilitated by processor registers that conform to one or more of these cell views and machine instructions for accessing or allocating cells in those views.

3.1.1 Tags and Types

All scalar data types and compound data structures are represented by cells or linked graphs of cells. Tagged pointers are used to accelerate suspension detection and run-time type checking [Joh90,SH91] . We will hereafter refer to a tagged pointer as a reference or citation; a raw or untagged pointer is simply a pointer. A reference contains three bits of tag information and twenty-four bits of pointer information. A reference's tag implies something about the storage class of the cell to which it is pointing, and vice versa, but tags and storage classes are different entities; the distinction is clarified in chapter 5. Briefly, a tag is a logical typing mechanism; it says that the object to which the reference points is a list, a number, etc. These objects have underlying representations using one or more cells conforming to a particular storage class; e.g. a list reference points to a binary cell. We may refer to a cell's type as either the cell's storage class or its logical (reference tag) type. In the former case the types are unary, binary and stack; in the latter case the types are list, number, and so forth.

3.2 Processor State

Figure 6 depicts the virtual machine state for a single processor node.

**Figure 6:** A Processor Node's Register State

The register file consists of a set of context windows. Each context window holds a single process (suspension) context. The context window pointer () always points to the currently running process. Control is transferred from one resident process to another by directly or indirectly changing the . This method of context switching allows several processes to be multiplexed rapidly on the same node, and provides a mechanism for interrupt handling. The number of context windows (and consequently the maximum number of simultaneously register-resident processes) is implementation dependent^3.1 Many current RISC-type architectures implement some form of register windows (e.g. the Sparc architecture) or have a large register set with renaming capability, both of which are capable of implementing our context window model. Context windows can also be implemented in memory, using register-based pointer indirection for the .

Each context contains four fixed storage class cell registers: , of type unary, , of type stack, , of type binary and , type unary. These registers are further subdivided into smaller addressable fields reflecting the cell views as described above. Registers , , , and are references, is a byte, is a pointer and and are 32-bit raw data words. A context window's state is identical to that of a suspension's, and the kernel swaps suspensions in and out of the context windows as necessary using the ldctx and stctx instructions.

Suspending construction provides the basis for the design of 's process structure. Suspension creation and coercion is the implicit behavior of any code that builds, traverses, or otherwise manipulates data structures. Process granularity is much finer than in conventional languages; accordingly, suspension manipulation is a critical performance issue. This is reflected in 's suspension design. Its small, fixed-size state (32 bytes) can be quickly allocated, initialized, loaded and stored. Conditional probes (e.g. hdc, tlc) provide implicit detection and dispatch. Because these and other suspension manipulation operations (e.g. suspend, ldctx, stctx) are implemented as instructions, they can be optimized at the lowest level of implementation for best performance on the host architecture.

In addition to the context windows, there are several global transient registers that are accessible from any context on the node. These are used as fixed operands by certain instructions and are also used as general-purpose registers for passing data between processes in different context windows. Registers and are references. and are untyped cell registers; they can be accessed in any storage format via the appropriate field name (e.g. WRK.ch1 or WRK.hd). A process can only access the registers in its own context window and the transient registers; the kernel is the only process that has global access to all windows.

The memory subsystem has some private registers that are described in chapter 5. There are also some registers devoted to signal handling that are described below in section 3.4.

3.3 Instruction Set

employs a load/store architecture that is similar in many respects to general-purpose RISC microprocessors. Its instruction set differs primarily in that it is very list (cell) oriented with dedicated instructions for allocation. Instructions operate on registers or immediate constants. There are explicit instructions for reading from and writing to memory in the cell format. Instructions fall into the following general categories:

Load instructions read cells and cell fields from memory into registers. We use the terms fetch and load interchangeably. There are two kinds of loads: conditional loads generate a trap if a suspension reference would be returned to the destination operand, unconditional loads do not. Conditional loads are the basis of suspension transparency and efficient detection, supported right down to the machine level^3.2.
Store instructions write register contents to memory. Like loads, there are conditional and unconditional versions of stores. The conditional versions will fail on any attempt to overwrite a non-suspension reference; this feature can be used to arbitrate interprocessor suspension scheduling (see [FW78d]), although DSI has more generalized primitives (e.g. atomic add/and/or) that can be used for more sophisticated coordination schemes.
Allocation instructions allocate cells from memory and initialize them using register contents in a single operation. This could be considered a special type of store operation. Data creation (i.e. cell allocation) is such a frequent occurrence in applicative systems that it warrants special consideration at the machine instruction level. This is discussed in more detail in section 3.5.1 and chapter 5.
ALU instructions provide simple arithmetic and logical operations on registers. These instructions are similar to those found on any traditional microprocessor.
Signal handling instructions are used to control and manipulate the hardware exception state.

3.4 Signals

exceptions and interrupts are handled under a common signal mechanism. Signals may be generated by hardware interrupts (e.g. timers, devices), instruction traps (e.g. a conditional load) or explicitly by the signal instruction. Some common uses of signals are:

Conditional loads whose result is a suspension reference.
Allocation instructions which invoke garbage collection.
I/O activity associated with a device.
Timer interrupts for context switching.
Bad references or ALU errors.
Tracing and machine-level debugging.
Interprocessor synchronization.

There are 32 possible signals, enumerated and prioritized from lowest (0) to highest (31). Depending on its use, a signal can be sent to a single processor or multicast to all processors. An example of the former case is a localized conditional-load trap occurring on a processor/memory board; an example of the latter is a barrier synchronization signal such as used in garbage collection. The signals are assigned by the kernel and are listed in Table 1 in chapter 4.

DSI uses a process-oriented approach to signal handling, in contrast to the stack-oriented approach used in most imperative systems. The latter usually handle exceptions by pushing state on the currently executing process stack (or a system stack) and invoking an exception-handling procedure. Under DSI, when a signal is delivered to a processor node it causes a context switch via a change of the context-window pointer (CWP) to an alternative resident handler process. The handler might be the kernel or another process (subsystem) designated to handle that signal. The currently executing process is suspended by the context switch, but its state is not side-effected. Neither is the state of the handler process; that is, no continuation is forced upon the handler, it simply resumes at the point where it had left off previously. Thus, the only action of a signal is to context switch to a handler process, which must be designed as an iterative signal handling loop. DSI instructions are designed to be restarted in the event of a signal preemption^3.3, so the preempted process can be restarted (if necessary) by changing the CWP back to the appropriate context window. However, since process state is not side-effected by signals (and assuming none in the surface language), it is not necessary to maintain a stack or other structure to recover from the effects of some permutation of delivered signals. All that is required is a scheduling algorithm that results in the the proper processes being rescheduled, and this is provided by DSI's demand-driven scheduling^3.4. This approach to exception handling fits well with the applicative framework that underpins the rest of DSI.

3.4.1 Signal Routing

Figure 7 shows DSI's signal handling architecture, which works like a state machine.

**Figure 7:** Signal Handling

The signal lookup table maps incoming signals (on- or off-node) with the current value of the CWP to determine the new value of the CWP. This allows the same signal to be mapped to different handlers depending on which context window is active. Arbitration between multiple pending signals is handled by the fixed priorities of the individual signals; the highest numbered unblocked signal has priority. The kernel takes this into account when assigning signals to services.

Delivery of signals is controlled by a pair of 32-bit registers: SIGLATCH and SIGMASK. Each bit in SIGLATCH corresponds to one of the 32 signals; when a signal is delivered to the node it sets the corresponding bit. SIGMASK is used to block action on signals registered in SIGLATCH. If a bit is clear in SIGMASK it inhibits the processing of the corresponding signal in SIGLATCH; i.e. the signal is effectively blocked. This capability can be used to mask interrupts for critical sections, to implement polled device drivers, etc. The signal, tstsig, setsigmask and clrsigmask instructions are used for manipulating the bits in these registers.

3.4.2 Per-process Signals

The preceeding discussion is concerned with signal handling at the system level, on a per-node basis. Signals are a system phenomenon and transitions are specified in terms of context windows, not individual suspensions. It is the kernel's job to micro-manage the context windows to map the signals between the individual processes. In order to handle the signal mappings of thousands of suspensions the kernel has to handle some signals itself and swap suspensions between the finite set of context windows and the heap as necessary. Currently, this is done only for a few signals with fixed, implicit mappings, such as the the conditional load and timer signals. However, it should be possible to generalize this mechanism, and by overloading one or more unused system signals, to carry a variable number of user-defined ``signals'' on a per-suspension basis. This capability would provide the foundation for user-extensible signaling and exception handling. Assuming that a usable construct could be formulated for Daisy, such a mechanism could be used for error recovery, event-driven scheduling, port-based message passing and other systems-level applications.

It is unclear at present whether and how demand-driven system scheduling would conflict with event-driven user scheduling. The incorporation of this kind of primitive would add another source of non-determinism (besides multisets) into the system, but given the underlying implementation of signals it is likely to be ``safe'' non-determinism. This is an interesting area that deserves further consideration.

3.5 Interconnection Topology

DSI assumes a shared memory, MIMD architecture. Aside from that, the virtual machine does not specify a particular processor interconnection topology, primarily because that aspect of the architecture does not figure as prominently in the programming of the machine given those assumptions. However, the host platform must provide a certain core functionality to successfully implement DSI. This includes:

1.: A globally-shared and addressable heap memory. DSI's reduction and communication models rely heavily on shared data structures. Those models are unlikely to be viably implemented on anything other than a physical shared memory machine (i.e. non-message passing). The shared memory can be organized as a NUMA architecture (see section 2.2.1), however. NUMA architectures require special considerations for locality that are not isolated within the virtual machine. These locality issues affect the process (chapter 6) and memory management (chapter 5) subsystems, and are discussed in those chapters.
2.: Symmetric MIMD multiprocessing. DSI assumes that all processors have the same basic configuration and functionality, and are not organized hierarchically. The distributed algorithms used assume a logical enumeration of processor nodes, and a given node must be able to determine its index.
3.: Efficient implementation of atomic 32-bit add, or, load and store operations. Atomic loads and stores are necessary to insure the integrity of references in the heap. The logical, bitwise atomic operations are used for interprocessor synchronization and updates on shared data structures.
4.: The ability to interrupt any processor in a ``timely'' fashion. DSI's signals can be piggybacked on a single user-programmable host exception, provided the overhead of native exception handling is not significantly greater than DSI's own signal handling model implies. A polling approach can also be used, if necessary.

Any parallel architecture that shares these characteristics is a suitable candidate for hosting DSI. This type of architecture (MIMD shared-memory symmetric multiprocessor) is common in emerging multiprocessors.

3.5.1 LiMP

Although the virtual machine does not specify a particular network design, special architectures have been proposed. In a series of papers, Johnson ([Joh81,Joh85,Joh89c]) refines a buffered routing network as the basis for a DSI multiprocessor. Wise describes a layout scheme for implementing such a network in [Wis81]. The network described is a Banyan design (see figure 8) connecting a set of List Processing Elements (LPEs) to a corresponding set of List Storage Elements (LSE's).

**Figure 8:** A *banyan* network

LPE's have roughly the same architecture described for DSI processors in section 3.2 above, but lack multiple context windows and signals. LSE's are intelligent memory banks, participating actively with the network in supporting the LPE's. Johnson [Joh81] classifies memory access instructions into fetch (RSVP), store (Sting) and allocation (NEW) requests, each having a distinct network transaction profile between LPE's and LSE's. RSVPs require a round-trip message from LPE to LSE and back to furnish the result. Stings are one-way requests requiring no acknowledgment. NEWs are handled in unit time by a cell routing arrangement called a NEW-sink, in which LSE's combine with the network switches to route the addresses of free list cells back towards the LPE's. Switches are endowed with buffers for the addresses; whenever a switch transfers an address out of its holding area to its outputs it requests a new one from its inputs. The switches are biased to transfer addresses directly across the network, but will transfer from the alternate input if demand is high. The idea is to preserve locality under normal allocation rates, minimizing contention at the switches for loads and stores, but to disperse high allocation rates across the machine so as to minimize hot spots. Johnson presents simulation results of LiMP in [Joh81].

The fundamental motivation behind the NEW-sink is to move resource management problems from software into hardware. If the goal of the NEW-sink is to handle data distribution then the goal of the Process-sink [Joh89c] is to manage load balancing. The Process-sink is a complementary operation in which LPE's route free suspension blocks through the network to LSE's. If LPE's only multi-task suspensions in their ``local'' space, then idle LPE's will be able to provide more free suspension blocks to the system than busy LPE's, effectively migrating new processes to idle nodes and balancing the load over the system. The behaviors of the NEW-sink and Process-sink are modeled in DSI by a load-based allocation strategy described in chapter 5.

3.5.2 The BBN Butterfly Architecture

DSI's has been implemented on a BBN Butterfly multiprocessor, selected because its shared-memory (NUMA) network architecture is similar to the LiMP architecture described above. The Butterfly differs from LiMP in that it uses a circuit-switched (non-buffered) enhanced banyan network. Each processing node contains a processor, a memory bank and network interface called the PNC (Processor Node Controller). Processors are tightly-coupled to their local memory bank, but all banks are accessible through the switch, as though the graph in figure 8 was folded laterally into a cylinder. Each processor actually has two paths for memory access: to its local memory over a local bus, or through the switch to any other node's memory bank. Each PNC arbitrates access to its memory bank between the processor and the network. Experiments indicate that in the absence of contention switch access to memory is about four times as slow as local access; with contention the switch access memory latency degrades even further. This leads to special considerations for memory management, which we generalize to all NUMA architectures.

3.6 Summary

Although 's instruction set is oriented towards list processing it otherwise provides much of the same functionality as a stock microprocessor, most of which are adequate for emulating a virtual machine. It is encouraging that stock hardware performance is improving so rapidly and carrying with it the performance of symbolic languages. Nevertheless, we could do much better, because there are a number of areas in which relatively minor hardware enhancements would reap great rewards in the execution efficiency of these languages, and narrow the performance gap with traditional languages. The key areas in which DSI would benefit from hardware support are context windows, allocation instructions and DSI-style signals, especially signals for trapping on reference tags, which would benefit dynamic type checking and suspension detection. Wise has identified a number of other features that would be useful for hardware-assisted heap management, such as support for reference counts [Wis85a,DW94].

Next: 4. The DSI Kernel Up: An Architecture for Parallel Previous: 2. A Primer

Eric Jeschke
1999-07-08