Basic Parallel Program Modeling
Suppose that a given program is to be run on a multiprocessor
system. The following introduces some basic terminology
and ideas related to parallel computing, but at the cost of
ignoring many real-world issues. Let
- t1 be the time required to run the
program on one processor
- tp be the time required to run the
program on p processors
- Sp = t1/tp is
called the parallel speedup.
- Ep = S1/p is
called the parallel efficiency.
Now some warnings.
- "speedups" can vary from machine to
machine, depends strongly on the compiler and system technology
applied, and can vary widely from one run to another. Unless
it is measured carefully, it is a useless number.
- In earlier (more innocent and honest) days,
speedup for a problem was defined as the best
uniprocessor algorithm time divided by the multiprocessor algorithm
time. The distinction is simple: sometimes we have to change
algorithms to get a parallel code, often at the price of using
a less sophisticated or slower-converging algorithm. The hope
is that the gain from using multiple processors outweighs the
loss from using less efficient algorithms. It usually makes the
runtime longer if
t1 is measured by simply running the multiprocessor
code on one processor. Even when using the same algorithm,
typically the code will contain sychronisation and other
time-consuming
features which are unnecessary in the uniprocessor case.
So when someone claims "parallel efficiency of 80%", be sure
to check carefully to see how the uniprocessor time was obtained;
if that information is not available you can typically disregard
any speedup numbers.
- On the surface, it seems 1 <= Sp < = p.
1 <= Sp results from knowing that the code can be run
on one processor, so it should be at least 1, right?
The Sp < = p constraint says if you use p processors,
the best you can get is a speedup of p.
However, both bounds can be and routinely are violated:
- Sp < 1 occurs when the costs
(synchronization, communication) of
parallelism exceed the benefits. This is common for codes which have
been run through a "parallelizing compiler".
- Sp > p can occur if the algorithm being implemented is
nondeterministic, and the parallel version actually does less work.
An example would be a branch-and-bound algorithm in which the parallel
code succeeds in pruning more branches than the sequential one.
- Sp > p also can occur because typically
bringing in more processors
also means bringing in more caches, so you have a faster aggregate
memory system working on the problem.
When Sp > p, the
problem is said to have achieved superlinear speedup.
[For the mathematicians: this is different from the superlinear
convergence rate of variable metric methods in optimization.]
Amdahl's Law
Ware's model of Amdahl's law says if
a computation is performed in two modes that run at different
rates of execution, the net speed depends on the percentage of
the computation in each mode. That's not a big surprise.
If f is the fraction of a
program that can be run in parallel (and so 1-f is the serial
or sequential fraction), then
t = ftp + (1-f)t1
and the min rate is 1/t1, max rate is
1/tp, and net rate of execution is 1/t.
Some simple algebra shows that the maximum realizable gain
achieved by parallelism is
g = r/rp = tp/t = 1/[f + (1-f)R]
where R = rp/r1 is the ratio of
execution rates. The figure below shows the gain for values
of R equal to 2, 10, and 20:
[Those curves were generated from the mathematical model, not from
actual experimental timings.]
Making the generous assumption that p processors together run
p times faster than one process,
to get 50% effective gain you will need to have over 90%
of your code running in parallel. Under those assumptions,
the speedup achievable is
Sp = t1/[(1-f)t1 + f(t1/p)] = p/[(1-f)p + f]
and so the parallel efficiency is
Ep = 1/[1 + (1-f)(p-1)]
This model assumes the overhead due to synchronization
and interprocessor communication, locks on shared resources, etc.
remains fixed regardless of the number of processors.
We would have to add in another term to the model, which
is a monotonically increasing function of p. What is the
minimal big-O growth characteristic that function
must have?
Gutafson proposed in the 1980's a different model,
starting with the parallel time as the base unit. That model is
less pessimistic.
Either model is inappropriate
much of the time because it assumes the parallel fraction of
a program remains fixed. Algorithmic research in parallel
computing consists of trying to "break the law". An approach
that is often fruitful uses John Rice's idea of a
polyalgorithm , where the algorithm used is changed
as the problem size or machine used changes.
Scalability
The term "scalability" is the most often abused term in parallel
and distributed computing, and it means different things to different
people. A
"scalable algorithm" is one for which the parallel efficiency remains
uniformly
bounded away from 0 for all values of p: there is a constant a, not
dependent on p, such that
Ep > a > 0,
as p increases. Amdahl's Law basically says there are no
scalable algorithms,
except for perfectly parallel ones - but if an algorithm is
completely parallel, then really it is simply processing
totally decoupled problems, something usually called
multiprogramming.
A problem with this definition of scalability is that it assumes
there will always be machines with more and more processors.
However, there is good reason to believe that no one will ever
build a machine with more than 1060 processors.
And quantum computers don't count here - they work with a form
of massive parallelism for which general programming methodologies are
not yet available.
Another problem occurs. Consider adding two vectors of length
n. That is a completely parallel operation ... but when the
number of processors p exceeds n, nothing is gained from adding
more processors. So
a variant on scalability is to use "scaled speedup",
where the problem size is allowed to grow with the number of
processors used, normally in such a way that the total memory
required per processor remains fixed.
Larger machines are typically used to give the capability to
solve larger problems, not just to
run the same-sized problem faster. A vector update of length 32 may be
run on a 32 processor machine with perfect parallelism, but
on 64 processors the efficiency is at best 50%. To handle this,
it is handy to let the problem size grow with the number of processors,
so that the algorithm can keep all of them busy. One way is to fix
the amount of memory per processor at M, and
as more processors are brought to bear the problem size n(p) is increased
to keep M constant. This leads to the concept of scaled speedup and
scaled parallel efficiency.
[Sometimes this is called "weak scalability".]
Execution time becomes a function of both p = the number of processors, and
the size of the problem as a function of p.
Define the scaled speedup as
S(n(p)) = T(1,n(p))/T(p,n(p))
and scaled efficiency as
E(n(p)) = T(1,n(p))/(T(p,n(p))*p)
[Compare this to the definitions at the top of the page.]
For a matrix*matrix multiplication update C = C + A*B, where all
of the matrices are square of order N = N(p), N
should satisfy 3*N2 = p*M,
where M is the (fixed) amount of memory used per processor.
This means the order of the matrix needs to grow as sqrt(p) as
p increases.
The scaled speedup is a good measure whenever it is possible; it helps
factor out the problem of additional memory being brought to bear on
a problem, eliminating one spurious source of superlinear speedups.
However, it is not always possible or practical to (re)define the
problem size in this way. For example, solving a PDE on a mesh
would require refining the mesh. Such refinements, however, may
introduce new physical effects like aliasing with deleterious
(or oppositly, advantageous!)
effects on the PDE solver.
- Started: Tue Jan 22 13:11:38 EST 2002
- Modifed: 26 Jan 2009