Basic Parallel Program Modeling


Suppose that a given program is to be run on a multiprocessor system. The following introduces some basic terminology and ideas related to parallel computing, but at the cost of ignoring many real-world issues. Let Now some warnings.

Amdahl's Law

Ware's model of Amdahl's law says if a computation is performed in two modes that run at different rates of execution, the net speed depends on the percentage of the computation in each mode. That's not a big surprise. If f is the fraction of a program that can be run in parallel (and so 1-f is the serial or sequential fraction), then
t = ftp + (1-f)t1
and the min rate is 1/t1, max rate is 1/tp, and net rate of execution is 1/t.

Some simple algebra shows that the maximum realizable gain achieved by parallelism is

g = r/rp = tp/t = 1/[f + (1-f)R]
where R = rp/r1 is the ratio of execution rates. The figure below shows the gain for values of R equal to 2, 10, and 20:

Amdahl

[Those curves were generated from the mathematical model, not from actual experimental timings.] Making the generous assumption that p processors together run p times faster than one process, to get 50% effective gain you will need to have over 90% of your code running in parallel. Under those assumptions, the speedup achievable is

Sp = t1/[(1-f)t1 + f(t1/p)] = p/[(1-f)p + f]
and so the parallel efficiency is
Ep = 1/[1 + (1-f)(p-1)]

This model assumes the overhead due to synchronization and interprocessor communication, locks on shared resources, etc. remains fixed regardless of the number of processors. We would have to add in another term to the model, which is a monotonically increasing function of p. What is the minimal big-O growth characteristic that function must have?

Gutafson proposed in the 1980's a different model, starting with the parallel time as the base unit. That model is less pessimistic.

Either model is inappropriate much of the time because it assumes the parallel fraction of a program remains fixed. Algorithmic research in parallel computing consists of trying to "break the law". An approach that is often fruitful uses John Rice's idea of a polyalgorithm , where the algorithm used is changed as the problem size or machine used changes.


Scalability

The term "scalability" is the most often abused term in parallel and distributed computing, and it means different things to different people. A "scalable algorithm" is one for which the parallel efficiency remains uniformly bounded away from 0 for all values of p: there is a constant a, not dependent on p, such that
Ep > a > 0,
as p increases. Amdahl's Law basically says there are no scalable algorithms, except for perfectly parallel ones - but if an algorithm is completely parallel, then really it is simply processing totally decoupled problems, something usually called multiprogramming. A problem with this definition of scalability is that it assumes there will always be machines with more and more processors. However, there is good reason to believe that no one will ever build a machine with more than 1060 processors. And quantum computers don't count here - they work with a form of massive parallelism for which general programming methodologies are not yet available.

Another problem occurs. Consider adding two vectors of length n. That is a completely parallel operation ... but when the number of processors p exceeds n, nothing is gained from adding more processors. So a variant on scalability is to use "scaled speedup", where the problem size is allowed to grow with the number of processors used, normally in such a way that the total memory required per processor remains fixed. Larger machines are typically used to give the capability to solve larger problems, not just to run the same-sized problem faster. A vector update of length 32 may be run on a 32 processor machine with perfect parallelism, but on 64 processors the efficiency is at best 50%. To handle this, it is handy to let the problem size grow with the number of processors, so that the algorithm can keep all of them busy. One way is to fix the amount of memory per processor at M, and as more processors are brought to bear the problem size n(p) is increased to keep M constant. This leads to the concept of scaled speedup and scaled parallel efficiency. [Sometimes this is called "weak scalability".]

Execution time becomes a function of both p = the number of processors, and the size of the problem as a function of p. Define the scaled speedup as

S(n(p)) = T(1,n(p))/T(p,n(p)) 
and scaled efficiency as
E(n(p)) = T(1,n(p))/(T(p,n(p))*p)
[Compare this to the definitions at the top of the page.] For a matrix*matrix multiplication update C = C + A*B, where all of the matrices are square of order N = N(p), N should satisfy 3*N2 = p*M, where M is the (fixed) amount of memory used per processor. This means the order of the matrix needs to grow as sqrt(p) as p increases.

The scaled speedup is a good measure whenever it is possible; it helps factor out the problem of additional memory being brought to bear on a problem, eliminating one spurious source of superlinear speedups. However, it is not always possible or practical to (re)define the problem size in this way. For example, solving a PDE on a mesh would require refining the mesh. Such refinements, however, may introduce new physical effects like aliasing with deleterious (or oppositly, advantageous!) effects on the PDE solver.