Basic Architecture Ideas for High Performance
Getting good performance from codes, or just understanding bad
performance when it occurs, requires some understanding of basic
computer architecture. Computer architecture can be a deep subject
which goes into the details of machine organization and its interaction
with hardware. Fortunately, there are only three general ideas
required for scientific computing, which have held throughout the
history of electronic computers:
- Data locality
- Pipelining
- Parallelism
Those concepts are the basis of all high performance architecture.
Here is a rough outline of what we need:
- Generalities
- Memory systems
- CPU speed vs. memory access speed
- Memory banks
- Memory hierarchies
- Cache model and examples
- Pipelining
- Instruction pipelining
- Loop unrolling
- Flop and memory pipes
- Enhancing data locality, pipelining in codes.
Memory banks were not covered in class, but are mentioned here because
of their importance for vector machines ... which may be making a
comeback.
General Concepts
A computer's architecture is a high level framework for the components
making up a computer system and their interconnection. Important features
for scientific computing consist of the memory system, the bus structure,
internal CPU design, and I/O systems. For parallel machines this is extended to include
the interconnection network (topology) for the processors.
You need to understand enough architecture to
- Know how to map particular algorithms to a machine.
- Develop implementations of algorithms suitable for the machine.
Scientific computations involve large amounts of data.
Consider climate modeling, a modern application
given latitude, longitude,
elevation, and time, we want to model temperature, pressure, humidity,
and velocity of air as time goes on. We might also propogate chemical
species since those have an effect on weather.
Discretize this by laying down a 1 km x 1 km mesh, with 11 mesh
points in vertical direction. Gives ~ 5x109 mesh points.
Output for a single time value
can require ~ 4.5 x 1010 doubles. Note that the
data can be arranged in large multidimensional array.
The primary bottleneck in scientific computing is in moving data,
not operating on data.
We will analyze this and see how it works
by using some small kernel operations operating on vectors and arrays,
e.g.,
- dotproduct: alpha = xTy
- daxpy or vector update: y = y + alpha x
- matrix-matrix multiplication: C = C + A*B.
These operations are more important than might appear at first view - they really
account for a large fraction of machine cycles in computational science
and engineering. The pseudo-code used for examples assumes:
The first architectural aspect to consider is effective use of a
memory hierarchy.
- Started: 03 Jan 1996
- Updated: Wed 05 Sep 2007 to fix some malapropisms
- Modified: Wed 16 Sep 2009, 07:32 PM re-organization
- Last Modified: Wed 16 Sep 2009, 07:33 PM