Basic Architecture of Parallel Systems
In the following, antique and quaint computers are mentioned, often made by
companies that died back when Reagan was President of the U.S. I.e., long
before you were born. One reason for this is that the 1980's were a time of
exploration and plain old fumbling around to define and design high-performance
parallel machines, so the types of systems were more varied than now. However,
the most important reason is that the ideas and motivations are still around,
and some of those systems approaches are being revived in the 2010's. No vendor
will ever admit to that, so the buzz-names will differ. The systems themselves
are more capable than their predecessors, both from Moore's Law and work
that has made previously impractical approaches viable.
Parallel hardware systems now can be roughly divided into two groups:
shared memory, where there is a single address space
and physical memory system, and distributed memory, where
each processor has its share of the system's memory attached
to it. Here address space refers to what you as a
programmer see. On a four-processor
shared memory machine you might declare an array A(1000000) of length
1000000 and access any part of it by simply referencing it via
A(k) from any processor. A distributed memory machine requires you
to declare A in four parts, each 250000 entries long - and when you
reference an entry you must keep track of both the entry and
which processor "owns" that particular datum.
The division is not strict; the
Illinois Cedar machine
(circa late 1980's) had local memory attached to
each group of eight processors forming a shared memory
component, with each processor able to access a "global" shared
memory as well. Both HP/Convex and SGI have had "distributed
shared memory" machines, which present a single address space
to the user, but have that memory physically distributed.
The OS and hardware then handles accessing the right entry and
the user does not have to keep track of which processor "owns" it.
They are also called NUMA machines, for Nonuniform Memory
Access, since it usually takes longer to access an operand physically
residing on a remote memory than on a local one. Another term that
SGI played up was CC-NUMA: cache-coherent NUMA.
First, some understanding of basic uniprocessor memory systems is
needed. The emphasis on memory systems here is because of
the fundamental performance principle of scientific computing:
most
numerical computations are limited not by processor speed,
but by the time of getting data to and from the processor.
A cache is a small, fast memory located near or on-chip with a processor.
Access to a memory word causes an entire line or block
of words
to be loaded into the cache (line sizes are typically 4-32 8-byte words).
Cache Access Times
Time to access a word from cache is usually 10 or more times faster than
getting it from memory.
This ratio has held surprisingly constant over 30 years now, so it is a
reasonable number to keep in mind. Each decade has its oddball weird
allegedly-HPC computers that have much smaller or larger ratios, but
the systems that survive commercially fall into the 10x ratio range.
Two Flavors of Locality
The reason for using a cache and cache lines
is based on data locality: if you used the word at location
m on one step, the next word you access is likely to have an
address near or adjacent to the one just accessed.
When the cache is full and a new line is brought in, some line
must be removed. The most commonly used replacement policy
is LRU: least recently used. The line that was accessed the most
distantly in time is replaced. Here the idea is based on
temporal locality; recently accessed words are more likely
soon to be accessed again (think about a loop index variable, for example.)
How a line is written back to memory can be in two ways:
- Write-back: when a word is stored, its value in the cache is
changed and a dirty bit is set for its cache line.
When a line is replaced, if its dirty bit is set the line
is written back to memory; otherwise it is discarded since the
version in memory is the same as the version in cache.
- Write-through: when a word is stored, send the modified cache
line to memory immediately.
Write-back is more efficient generally, since it involves many fewer
stores to memory. Write-through makes some parallel processing easier as
will be seen later.
Multilevel Caches
Modern processor design uses multiple
levels of cache; three is common and four increasingly so. This
trend is helpful for serial computing - otherwise vendors would
not build them. However, this deep memory hierarchy has some
serious consequences for parallel computing. Essentially, when
processors need to communicate by sending a message or datum from
one to the other, that datum must burrow its way upwards through the
first processor's memory hierarchy, then downwards through the
second processor's memory hierarchy, before it can be used by the
second processor. So in addition to the cost
of sending the data across whatever communication substratum exists,
there is the cost of traversing two memory systems - and perturbing
data in the caches along the way.
Getting on the bus
A straightforward way to connect several processors together to
build a multiprocessor is to have each processor and memory
module hang off of a bus, a memory channel which on PC's often takes
the form of a broad ribbon connector.
The physical
connections are quite simple. Most bus structures allow an
arbitrary (but not too large) number of devices to communicate
over the bus. Bus protocols were initially designed to allow a
single processor and one or more disk or tape controllers to
communicate with memory. If the I/O controllers are replaced by
processors, you have a small single-bus multiprocessor.
The
problem with this design is that processors contend for
access to the bus.
If processor P is fetching an instruction,
all other processors must wait until
the bus is free. If
there are only two processors they can perform close to their
maximum rate since the bus can alternate between them: as one
processor is decoding and executing an instruction, the other can
be using the bus to fetch its next instruction. However, when a
third processor is added performance begins to degrade. Usually
10 processors connected to the bus flattens out the
performance curve so that adding more
processors does not increase performance.
The memory and bus have a fixed bandwidth,
determined by a combination of the cycle time of the memory and
the bus protocol, and in a single-bus multiprocessor this
bandwidth is divided among several processors. If the processor
cycle time is slow compared to the memory cycle, a fairly
large number of processors can be accommodated by this plan, but
since processor cycles are usually much faster than memory
cycles this scheme is not scalable.
Stay off the bus, and in your own locality
A modification
to this design will improve performance, but it cannot
indefinitely postpone the flattening of the performance curve. If
each processor has its own local cache and data locality
of the program is good, then it is likely
that the data it needs is in the
local cache. A good cache hit rate will greatly reduce the
number of accesses a processor makes and thus improve overall
efficiency. The dogleg of the performance curve, which identifies
a point where it is still cost-effective to add processors, extends
to around 20 processors, and the curve does not flatten out
until around 30 processors.
Incoherency
Giving each processor its own cache
introduces the cache coherency problem.
Suppose two processors use data item A,
so A
ends up in the cache of both processors. Next suppose processor 1
performs a calculation that changes A.
When it is done, the new
value of A is written out to main memory.
At a later time, processor 2
needs to fetch A. However, since A was already
in its cache,
it will use the cached value and not the newly updated value
calculated by processor 1. Maintaining a consistent version of
shared data requires providing new versions of the cached data to
each processor whenever one of the processors updates its copy.
The typical approach is called a "snooping protocol", where each
processor "listens" on the bus for address requests and update
postings.
Switching to a different way
Another way of building a shared memory multiprocessor is to
replace the
bus with a switch that routes requests from a
processor to one of several different memory modules. Even though
there are several physical memories, there is one large virtual
address space. The advantage of this organization is based on having
switchs that can handle multiple requests in parallel. Each
processor can be paired up with a memory, and each can then run
at full speed as it accesses the memory it is currently connected
to. Contention still occurs, since if two processors make
requests of the same memory module only one will be given access
and the other will be blocked.
Various switch designs include
- cross-bar
- tree
- butterfly
- shuffle-exchange
- fat tree
The great divide in parallel architecture is shared versus distributed
memory systems, although the distinctions are becoming more blurred
as time goes on.
In practice, the distinction to a programmer is whether or not
the memory is logically shared or distributed. That in
turn depends on whether the memory presents a single or a multiple
address space.
Logically shared memory machines are ones with a single
address space. As we will see, this makes the porting and
programming problem much easier.
As the material on interconnection networks
between processors and memory shows,
the problems of bandwidth limits and network congestion can be
alleviated by having a large cache with each processor - at the price
of worrying about cache coherency. If this idea is carried to extremes,
move all of the memory to be local to the processors. This gives a
distributed memory system, one where each processor has its own memory - and
its own address space. But now the programmer is required to
- explicitly distribute the program data amongst the processors
- synchronize between them
- communicate results between the processors by sending messages
Design scalability
The advantage of distributed memory systems is that they are more
"scalable". The word scalable is bandied about a great deal in
parallel computing, but it is like the word "soup" - it means drastically
different things to different people at different times.
Here scalability is primarily of an architectural variety:
distributed memory machines consist of fungible components, and you
can buy more and plug them in as needed.
With a suitable problem and code, their performance can
also be scalable. But using distributed memory
introduces two sources of overhead: it takes time to construct
and send a message from one processor to another, and a receiving
processor must be interrupted or flagged to deal with messages from
other processors.
So in a distributed memory system the memory is associated
with individual processors and a processor is only able to
address its own memory. Sometimes this is called
a multicomputer system, since the building blocks
in the system are themselves small computer systems complete with
processor and memory. The IBM SP/2, for example, was originally just a
collection of RS/6000 workstations tied together with a fast
interconnect network, with each RS/6000 running its own copy of the OS.
The IU CS Department's Odin cluster is essentially standard Amdahl nodes
with a fast Infiniband network.
In a distributed memory system,
each processor can utilize the full bandwidth to its own local memory
without interference from other processors. There is no inherent
limit to the number of processors as with bus-based systems.
The size of the system is now constrained only by the
network used to connect processors to each other. There
are no cache coherency problems (more accurately, the user becomes
responsible for maintaining coherency). Each processor is in
charge of its own data, and other processors cannot access it without
going through explicit actions commanded by the program.
Programming distributed memory systems
Programming on a distributed memory machine means
organizing your program as a set of independent tasks
that communicate with each other via messages. The
programmer must be aware of where data is stored (that is,
on which processor it resides), which
introduces a new form of locality in algorithm design.
An algorithm that allows data to be partitioned into discrete
units and then runs with minimal communication between units will
be more efficient than an algorithm that requires random access
to global structures.
Semaphores, monitors, and other concurrent
programming techniques are not directly applicable on distributed
memory machines, but they can be implemented by a layered
software approach. User code can invoke a semaphore, for example,
which is itself implemented by passing a message to the node that
``owns'' the semaphore. This approach is not efficient.
Which programming style is easier - shared memory with
semaphores, etc. or distributed memory with message passing - is
often a matter of background; however, most application end users
find the shared
memory model easier to deal with, at least initially. The message passing style
can fit well with an object oriented programming methodology,
and if a program is already organized in terms of objects it may
be relatively easy to adapt it for a distributed memory system.
Beware: it is not just the computational core of a program that must
be rewritten to shift to distributed memory. All the parts that are
usually seen as minor (I/O, initializing data structures, passing
around parameters from the command line, ...) must also reflect the shift.
Choosing to implement a program in shared
memory versus distributed memory is usually based on the
amount of information that must be shared by parallel tasks.
Whatever information is shared among tasks must be copied from
one node to another via messages in a distributed memory system,
and this overhead may reduce efficiency to the point where a
shared memory system is preferred.
The lower cost of distributed memory clusters has in practice meant
that in HPC, distributed memory programming dominates. So usually,
you have not choice.
What's a processor
Single nodes in a distributed memory system are often called
processing elements, or PEs.
Modern systems have nodes with multiple processors and each processor
has multiple cores , which in turn may be hyper-threaded, introducing
another level of parallelism. Any of those levels can be considered
as single PE. Using the PE terminology helps avoid some misunderstandings
at the risk of introducing others. Always specify what you are identifying
as a single PE.
Notation
To avoid issues about what constitutes a PE,
for MPI distributed memory programming in this course the
term process will be used to refer to one of the independent cooperating communicating
elements that are more traditionally called "processors". A single MPI process
may be mapped to a node, a CPU processor, a core on a quad-core processor, or (in the future)
an individual vector stream in a GPU.
Beware that in many books, codes, and other material the term "processor"
may be used, but interpret that as "process". In distributed memory coding,
a single processor may have many MPI processes mapped to it, and in some cases
it is useful to map a single MPI process to a single node even if it has 2 quad-core
processors.
View of the system by a single PE
To any PE, the other PEs are like I/O
devices. To send a message to another PE, a processor copies
information into a buffer in its local memory and then tells
its local controller to transfer the information to an external
device, much the same way a disk controller in a desktop machine
would write a block on a disk drive. In this case, however, the
block of data is transferred over the interconnection network to
an I/O controller in the receiving node. That controller finds
room for the incoming message in its local memory and then
notifies the processor that a message has arrived.
To avoid tying up the computations while the communication is going on,
the ancient Intel Paragon had each PE contain two i860 processors.
The intent was that one handles communications alone and the other
handles computation, allowing the two to be overlapped.
In practice most people just used both as computational PE's.
As always happens in computer science when there are two
paradigms, each with complementary strengths, hybridization
efforts try to build systems that have the strengths of both.
A blurring of the distinction between shared and distributed memory
systems has always been around, and takes at least
three forms:
- Hybrid systems mix the two flavors of memory.
One form consists of an array of shared memory multiprocessors,
tied together with a ultrafast network. Another flavor is
to connect shared memory multiprocessors together with a
global memory system (as with Cedar machine mentioned above),
separate from the different shared memories.
- Parallel languages such as High Performance Fortran (HPF),
Titanium, UPC, or HPC++ rely on compilers to map user data to different address
spaces, and then a runtime system manages the necessary message
passing for a user.
- "Distributed shared memory" systems have physically distributed
memory, but rely on a combination of operating system and hardware
to move address references where they are needed. Here the user
has a single logically shared address space, but accessing data
beloging to another processor can take significantly longer than
accessing it from the local memory - leading to the term NUMA,
or non-uniform memory access.
A historical example hybrid system was the SGI Origin 2000. It
uses a 4D hypercube to connect up to 16 "nodes". Each node is a board with
two processors sharing a single memory. Systems with more than 16 nodes
are then tied together with a "Craylink Interconnect",
high speed links and routers connecting the hypercubes.
The user sees a single address space and need not explicitly partition
program data or write message passing code. However, the practical experience
of the machine is that to get good performance, the user needs to be aware
of and take active role in locating data on the machine.
- Started: Mon 24 Jan 2011, 07:38 AM
- Modified: Tue 21 Feb 2012, 07:59 PM to omit memory banks
- Last Modified: Tue 21 Feb 2012, 07:59 PM