Basic Architecture of Parallel Systems


In the following, antique and quaint computers are mentioned, often made by companies that died back when Reagan was President of the U.S. I.e., long before you were born. One reason for this is that the 1980's were a time of exploration and plain old fumbling around to define and design high-performance parallel machines, so the types of systems were more varied than now. However, the most important reason is that the ideas and motivations are still around, and some of those systems approaches are being revived in the 2010's. No vendor will ever admit to that, so the buzz-names will differ. The systems themselves are more capable than their predecessors, both from Moore's Law and work that has made previously impractical approaches viable.

Parallel hardware systems now can be roughly divided into two groups: shared memory, where there is a single address space and physical memory system, and distributed memory, where each processor has its share of the system's memory attached to it. Here address space refers to what you as a programmer see. On a four-processor shared memory machine you might declare an array A(1000000) of length 1000000 and access any part of it by simply referencing it via A(k) from any processor. A distributed memory machine requires you to declare A in four parts, each 250000 entries long - and when you reference an entry you must keep track of both the entry and which processor "owns" that particular datum.

The division is not strict; the Illinois Cedar machine (circa late 1980's) had local memory attached to each group of eight processors forming a shared memory component, with each processor able to access a "global" shared memory as well. Both HP/Convex and SGI have had "distributed shared memory" machines, which present a single address space to the user, but have that memory physically distributed. The OS and hardware then handles accessing the right entry and the user does not have to keep track of which processor "owns" it. They are also called NUMA machines, for Nonuniform Memory Access, since it usually takes longer to access an operand physically residing on a remote memory than on a local one. Another term that SGI played up was CC-NUMA: cache-coherent NUMA.

First, some understanding of basic uniprocessor memory systems is needed. The emphasis on memory systems here is because of the fundamental performance principle of scientific computing: most numerical computations are limited not by processor speed, but by the time of getting data to and from the processor.


Caches

A cache is a small, fast memory located near or on-chip with a processor. Access to a memory word causes an entire line or block of words to be loaded into the cache (line sizes are typically 4-32 8-byte words).

Cache Access Times

Time to access a word from cache is usually 10 or more times faster than getting it from memory. This ratio has held surprisingly constant over 30 years now, so it is a reasonable number to keep in mind. Each decade has its oddball weird allegedly-HPC computers that have much smaller or larger ratios, but the systems that survive commercially fall into the 10x ratio range.

Two Flavors of Locality

The reason for using a cache and cache lines is based on data locality: if you used the word at location m on one step, the next word you access is likely to have an address near or adjacent to the one just accessed. When the cache is full and a new line is brought in, some line must be removed. The most commonly used replacement policy is LRU: least recently used. The line that was accessed the most distantly in time is replaced. Here the idea is based on temporal locality; recently accessed words are more likely soon to be accessed again (think about a loop index variable, for example.)

How a line is written back to memory can be in two ways:

  1. Write-back: when a word is stored, its value in the cache is changed and a dirty bit is set for its cache line. When a line is replaced, if its dirty bit is set the line is written back to memory; otherwise it is discarded since the version in memory is the same as the version in cache.
  2. Write-through: when a word is stored, send the modified cache line to memory immediately.
Write-back is more efficient generally, since it involves many fewer stores to memory. Write-through makes some parallel processing easier as will be seen later.

Multilevel Caches

Modern processor design uses multiple levels of cache; three is common and four increasingly so. This trend is helpful for serial computing - otherwise vendors would not build them. However, this deep memory hierarchy has some serious consequences for parallel computing. Essentially, when processors need to communicate by sending a message or datum from one to the other, that datum must burrow its way upwards through the first processor's memory hierarchy, then downwards through the second processor's memory hierarchy, before it can be used by the second processor. So in addition to the cost of sending the data across whatever communication substratum exists, there is the cost of traversing two memory systems - and perturbing data in the caches along the way.


Shared Memory Processor-Memory Organization

Getting on the bus

A straightforward way to connect several processors together to build a multiprocessor is to have each processor and memory module hang off of a bus, a memory channel which on PC's often takes the form of a broad ribbon connector. The physical connections are quite simple. Most bus structures allow an arbitrary (but not too large) number of devices to communicate over the bus. Bus protocols were initially designed to allow a single processor and one or more disk or tape controllers to communicate with memory. If the I/O controllers are replaced by processors, you have a small single-bus multiprocessor.

The problem with this design is that processors contend for access to the bus. If processor P is fetching an instruction, all other processors must wait until the bus is free. If there are only two processors they can perform close to their maximum rate since the bus can alternate between them: as one processor is decoding and executing an instruction, the other can be using the bus to fetch its next instruction. However, when a third processor is added performance begins to degrade. Usually 10 processors connected to the bus flattens out the performance curve so that adding more processors does not increase performance. The memory and bus have a fixed bandwidth, determined by a combination of the cycle time of the memory and the bus protocol, and in a single-bus multiprocessor this bandwidth is divided among several processors. If the processor cycle time is slow compared to the memory cycle, a fairly large number of processors can be accommodated by this plan, but since processor cycles are usually much faster than memory cycles this scheme is not scalable.

Stay off the bus, and in your own locality

A modification to this design will improve performance, but it cannot indefinitely postpone the flattening of the performance curve. If each processor has its own local cache and data locality of the program is good, then it is likely that the data it needs is in the local cache. A good cache hit rate will greatly reduce the number of accesses a processor makes and thus improve overall efficiency. The dogleg of the performance curve, which identifies a point where it is still cost-effective to add processors, extends to around 20 processors, and the curve does not flatten out until around 30 processors.

Incoherency

Giving each processor its own cache introduces the cache coherency problem. Suppose two processors use data item A, so A ends up in the cache of both processors. Next suppose processor 1 performs a calculation that changes A. When it is done, the new value of A is written out to main memory. At a later time, processor 2 needs to fetch A. However, since A was already in its cache, it will use the cached value and not the newly updated value calculated by processor 1. Maintaining a consistent version of shared data requires providing new versions of the cached data to each processor whenever one of the processors updates its copy. The typical approach is called a "snooping protocol", where each processor "listens" on the bus for address requests and update postings.

Switching to a different way

Another way of building a shared memory multiprocessor is to replace the bus with a switch that routes requests from a processor to one of several different memory modules. Even though there are several physical memories, there is one large virtual address space. The advantage of this organization is based on having switchs that can handle multiple requests in parallel. Each processor can be paired up with a memory, and each can then run at full speed as it accesses the memory it is currently connected to. Contention still occurs, since if two processors make requests of the same memory module only one will be given access and the other will be blocked.

Various switch designs include


Distributed Memory Machines

The great divide in parallel architecture is shared versus distributed memory systems, although the distinctions are becoming more blurred as time goes on. In practice, the distinction to a programmer is whether or not the memory is logically shared or distributed. That in turn depends on whether the memory presents a single or a multiple address space. Logically shared memory machines are ones with a single address space. As we will see, this makes the porting and programming problem much easier.

As the material on interconnection networks between processors and memory shows, the problems of bandwidth limits and network congestion can be alleviated by having a large cache with each processor - at the price of worrying about cache coherency. If this idea is carried to extremes, move all of the memory to be local to the processors. This gives a distributed memory system, one where each processor has its own memory - and its own address space. But now the programmer is required to

Design scalability

The advantage of distributed memory systems is that they are more "scalable". The word scalable is bandied about a great deal in parallel computing, but it is like the word "soup" - it means drastically different things to different people at different times. Here scalability is primarily of an architectural variety: distributed memory machines consist of fungible components, and you can buy more and plug them in as needed. With a suitable problem and code, their performance can also be scalable. But using distributed memory introduces two sources of overhead: it takes time to construct and send a message from one processor to another, and a receiving processor must be interrupted or flagged to deal with messages from other processors.

So in a distributed memory system the memory is associated with individual processors and a processor is only able to address its own memory. Sometimes this is called a multicomputer system, since the building blocks in the system are themselves small computer systems complete with processor and memory. The IBM SP/2, for example, was originally just a collection of RS/6000 workstations tied together with a fast interconnect network, with each RS/6000 running its own copy of the OS. The IU CS Department's Odin cluster is essentially standard Amdahl nodes with a fast Infiniband network.

In a distributed memory system, each processor can utilize the full bandwidth to its own local memory without interference from other processors. There is no inherent limit to the number of processors as with bus-based systems. The size of the system is now constrained only by the network used to connect processors to each other. There are no cache coherency problems (more accurately, the user becomes responsible for maintaining coherency). Each processor is in charge of its own data, and other processors cannot access it without going through explicit actions commanded by the program.

Programming distributed memory systems

Programming on a distributed memory machine means organizing your program as a set of independent tasks that communicate with each other via messages. The programmer must be aware of where data is stored (that is, on which processor it resides), which introduces a new form of locality in algorithm design. An algorithm that allows data to be partitioned into discrete units and then runs with minimal communication between units will be more efficient than an algorithm that requires random access to global structures.

Semaphores, monitors, and other concurrent programming techniques are not directly applicable on distributed memory machines, but they can be implemented by a layered software approach. User code can invoke a semaphore, for example, which is itself implemented by passing a message to the node that ``owns'' the semaphore. This approach is not efficient.

Which programming style is easier - shared memory with semaphores, etc. or distributed memory with message passing - is often a matter of background; however, most application end users find the shared memory model easier to deal with, at least initially. The message passing style can fit well with an object oriented programming methodology, and if a program is already organized in terms of objects it may be relatively easy to adapt it for a distributed memory system. Beware: it is not just the computational core of a program that must be rewritten to shift to distributed memory. All the parts that are usually seen as minor (I/O, initializing data structures, passing around parameters from the command line, ...) must also reflect the shift. Choosing to implement a program in shared memory versus distributed memory is usually based on the amount of information that must be shared by parallel tasks. Whatever information is shared among tasks must be copied from one node to another via messages in a distributed memory system, and this overhead may reduce efficiency to the point where a shared memory system is preferred. The lower cost of distributed memory clusters has in practice meant that in HPC, distributed memory programming dominates. So usually, you have not choice.


What's a processor

Single nodes in a distributed memory system are often called processing elements, or PEs. Modern systems have nodes with multiple processors and each processor has multiple cores , which in turn may be hyper-threaded, introducing another level of parallelism. Any of those levels can be considered as single PE. Using the PE terminology helps avoid some misunderstandings at the risk of introducing others. Always specify what you are identifying as a single PE.

Notation

To avoid issues about what constitutes a PE, for MPI distributed memory programming in this course the term process will be used to refer to one of the independent cooperating communicating elements that are more traditionally called "processors". A single MPI process may be mapped to a node, a CPU processor, a core on a quad-core processor, or (in the future) an individual vector stream in a GPU. Beware that in many books, codes, and other material the term "processor" may be used, but interpret that as "process". In distributed memory coding, a single processor may have many MPI processes mapped to it, and in some cases it is useful to map a single MPI process to a single node even if it has 2 quad-core processors.

View of the system by a single PE

To any PE, the other PEs are like I/O devices. To send a message to another PE, a processor copies information into a buffer in its local memory and then tells its local controller to transfer the information to an external device, much the same way a disk controller in a desktop machine would write a block on a disk drive. In this case, however, the block of data is transferred over the interconnection network to an I/O controller in the receiving node. That controller finds room for the incoming message in its local memory and then notifies the processor that a message has arrived. To avoid tying up the computations while the communication is going on, the ancient Intel Paragon had each PE contain two i860 processors. The intent was that one handles communications alone and the other handles computation, allowing the two to be overlapped. In practice most people just used both as computational PE's.


Hybrid Memory Organization

As always happens in computer science when there are two paradigms, each with complementary strengths, hybridization efforts try to build systems that have the strengths of both. A blurring of the distinction between shared and distributed memory systems has always been around, and takes at least three forms:
  1. Hybrid systems mix the two flavors of memory. One form consists of an array of shared memory multiprocessors, tied together with a ultrafast network. Another flavor is to connect shared memory multiprocessors together with a global memory system (as with Cedar machine mentioned above), separate from the different shared memories.
  2. Parallel languages such as High Performance Fortran (HPF), Titanium, UPC, or HPC++ rely on compilers to map user data to different address spaces, and then a runtime system manages the necessary message passing for a user.
  3. "Distributed shared memory" systems have physically distributed memory, but rely on a combination of operating system and hardware to move address references where they are needed. Here the user has a single logically shared address space, but accessing data beloging to another processor can take significantly longer than accessing it from the local memory - leading to the term NUMA, or non-uniform memory access.
A historical example hybrid system was the SGI Origin 2000. It uses a 4D hypercube to connect up to 16 "nodes". Each node is a board with two processors sharing a single memory. Systems with more than 16 nodes are then tied together with a "Craylink Interconnect", high speed links and routers connecting the hypercubes. The user sees a single address space and need not explicitly partition program data or write message passing code. However, the practical experience of the machine is that to get good performance, the user needs to be aware of and take active role in locating data on the machine.