Introduction to Parallel Computing
Modern research applications continue to challenge computers with the
size and amount of computation required. Those appls include weather
modeling, climate models, fusion energy simulations, crash-worthy testing
of proposed automobiles.
All other aspects
of computing must be matched to the computation rate: I/O, memory sizes,
archival store/retrieve, visualization, and networks. The principals and
techniques covered in this class will in many cases also apply to those
parts of large scale applications as well. The most central one has held
from the beginning of digital computing: calculations with data are not
expensive or even the limiting factor; it is instead the movement of data.
For serial (single processor) applications that refers to moving data
between hard drives, main memory, caches, and the processors. Parallel
computing adds another layer to this data hierarchy, since with few
exceptions it requires moving data from one machine to another during a
single application.
The traditional computer model is a "von Neuman machine": instructions
are executed sequentially in a repeated cyle:
-
Instruction fetch and decode
-
Addresses of operands calculated
-
Operands fetched from memory
-
Operation performed using operands
-
Results written back to memory
Hardware improvements continue to increase speed exponentially,
but cutting edge large scale applications still require harnessing multiple
systems operating simultaneously.
Further improvements can come from parallelism,
which takes several forms:
-
multiple functional units. E.g., separate add and multiply units which
can be simultaneously active, or integer arithmetic units which can be
active at the same time as the floating point units. Modern processors
now routinely can simultaneously carry out a memory operation, and integer
arithmetic operation, at least one floating point operation, and a branch
operation all in the same clock cycle (superscalar architecture). This
is a form of instruction level parallelism, it it relies upon the
compiler to discover instructions in a program that can be run simultaneously,
and then to schedule those instructions to fully utilize processor resources.
-
pipelining and chaining (vector machines)
-
overlapping instructions (VLIW)
-
multiple interleaved memory systems to keep up with CPUs; recall the fundamental
idea from P573 was that the principle bottleneck in scientific (and most
other) computing is in getting data to and from the processor.
-
multiple CPU's working together.
Parallelism was introduced early on in machines, particularly for allowing
memory accesses to be overlapped with instruction execution.
Effective use of parallelism requires an integrated CS approach, involving:
-
sophisticated compilers for resource allocation (for pipelining and chaining)
and instruction scheduling (for superscalar).
-
languages to express parallelism
-
new algorithms that exploit or introduce parallelism into a problem.
-
debugging and performance analysis tools
First some partitioning of the design space of parallel architectures
is needed.
Next: Flynn's taxonomy