Computational Kernels
As matrix-matrix multiply shows, using techniques
to achieve high performance can cause the code to become dense and
opaque. And as pointed out before, optimizing compilers cannot
always do the work for us, either because the complexity of the code
prevents it or because it requires changing the order of floating
point operations, which can lead to different numerical results.
However, many scientific computations can be decomposed
into small kernel operations,
such as matrix-matrix multiplication. Examples include
- Dense linear algebra (dense means vectors and matrices which have
mainly nonzero values)
- Sparse linear algebra (which means vectors and matrices with most
entries equal to zero)
- FFT: fast Fourier transforms
- STL: standard template library, which includes several standard
data structures (stacks, queues, trees, etc) and algorithms on them
Since kernels occur frequently,
computer vendors can then provide highly optimized for their architectures.
In many cases this means assembly language coded versions which
fully utilize the computer's capabilities.
Higher level algorithms can then be expressed
in terms of those kernels, allowing high performance computations.
However, this
relies on your ability to recognize the kernels in your code.
The prototype and original versions of kernel libraries are the BLAS
for dense linear algebra.
The Basic Linear Algebra Subroutines
The BLAS (Basic Linear Algebra Subroutines) are
the paleolithic computational kernel library.
They are classified into three groups, depending on the type of
operands involved.
- BLAS Level 1: vector-vector operations
- Vector copy: x = y
- Vector update (saxpy): y = y + alpha*x
- Inner product: alpha = xTy
- Two norm of a vector: alpha = sqrt{xTx}
- Vector scaling: x = alpha*x
- BLAS Level 2: matrix-vector operations
- Matrix-vector product: y = alpha y + beta A*x
- Transpose matrix-vector product:
y = alpha y + beta AT*x
- Rank-1 update A = A + x*yT.
- Triangular solve: Solve Lx = b or Ux = b for x,
where L is a lower triangular nxn matrix and b is nx1.
- BLAS Level 3: matrix-matrix operation
- Matrix product: C = alpha C + beta*A*B
- Triangular solve with multiple right hand side
vectors: solve LX = B or UX = B for X, where L or U is nxn
triangular and B is nxk. This is mathematically the
same as solving the sequence of ordinary triangular solves
Lxi = bi, i = 1, ..., k, where
xi is the i-th column of X and bi
is the i-th column of B.
Reminder:
the notation above follows the "Householder convention":
- Greek letters indicate scalars,
- small Roman letters indicate vectors,
- and capital Roman letters indicate matrices.
Householder was one of the early founders of numerical linear algebra.
Everyone found the convention he used in his book
Principles of Numerical Analysis so useful it has become a
de facto standard. Since the Greek letter set is not supported by all
browsers, I usually spell them out.
Feeling a little guilty about not recognizing the linear algebra
operations? You should know what all of the ones listed above
do; if not you might check the
Matrix Manual
for any linear algebra terminology you don't recognize.
(Unfortunately that is now a dead link, left here for now to see
if I can track it down elsewhere.)
Choosing Which BLAS to Use
Knowing which BLAS will get good performance harks back to the
basic idea that data movement needs to be minimized:
Whenever possible, use a BLAS routine with lower ratio of memory
references to floating point operations.
That principle requires some explanation.
Of course, there is no way to turn a dotproduct operation into a
saxpy one. Instead, we have to seek opportunities to re-state algorithms
so that in the innermost loop(s), they use a more favorable BLAS function.
Also, clearly defining the number of memory
references could depend upon the particular code. When analyzing the
memory reference to flop count ratio, just use the mathematical statement
of the problem, not the (nearly impossible to determine) memory references
that a particular implementation has. This is like assuming that the
machine has an infinite number of registers and only one level of memory.
In that case, you are just counting up all the data from the vectors and
matrices in the problem (to get the number of reads required), added to
the size of the result (for the number of writes required).
For the flop count, assume the minimal possible to implement the algorithm
without extra memory being used. That you can get by looking at a
particular implementation. So in the ratio of m/f = memory references/flops,
the numerator can be found from a code or algorithm statement, while
the denominator can be found from a bald mathematical statement of the
operation. The intent is to get a lower bound on the ratio, not
an exact count.
This excludes "reduced order"
methods like the Strassen algorithm for matrix-matrix multiplication.
Those are completely different algorithms and
give different results in finite precision arithmetic. More deadly,
for the few scientific applications that involve dense (dense = mostly
nonzero) matrices, the amount of memory is the limiting bottleneck and
a library implementation that increases memory usage is usually unacceptable.
Regardless, reduced order methods for matrix-matrix multiplication have
their number of flops < O(n2.7)
instead of the usual O(n3), and the asymptotic load-store
analysis still holds.
Example: vector update y = y + alpha*x.
for k = 1:n
y(k) = y(k) + alpha*x(k)
end for
Since alpha is likely kept in a register (which even a brain-dead compiler
will do),
each iteration has two reads (x(k) and y(k)) and one write (y(k)) for
three memory references. More correctly, we count the minimal number
of memory references by looking at the mathematical operation. There
are one scalar and two vectors to read, one to write,
again giving the same thing:
3n+1 memory reference. Each iteration has an add and a multiply, giving
a ratio of (mem ref)/(flops) = (3n+1)/2n.
Example: inner product alpha = xTy.
alpha = 0.0
for k = 1:n
alpha = alpha + x(k)*y(k)
end for
This requires reading two vectors of length n and writing one scalar, giving
2n+1 memory references.
Each iteration performs two flops, giving a
ratio of (mem ref)/(flops) = (2n+1)/(2n-1) → 1.
If x = y, that ratio is further reduced to 1/2.
Now consider matrix-vector product y = y + Ax
implemented in two different ways:
Version 1:
for k = 1:n
for i = 1:n
y(k) = y(k) + A(k,i) * x(i)
end for
end for
Version 2:
for i = 1:n
for k = 1:n
y(k) = y(k) + A(k,i) * x(i)
end for
end for
Both versions perform the exact same amount of flops.
Both versions also perform the summations which give
an entry y(k) in the exact same order, so they are identical
even in floating point arithmetic. Which one is preferable?
To answer that, first we need to answer the question
"What are the kernel operations expressed by the innermost loops?".
Ignoring the row versus column storage issue for A, which is going
to be faster? Why? How much faster might it be? The answer to
this last question can tell you if it is worth the effort to
convert the algorithm from one to another.
Actually, the matrix-vector product itself is a BLAS-2
operation, so probably it would not be implemented as a sequence
of calls to a BLAS-1 kernel for reasons to soon be clear.
Counting "Memory References" in the BLAS
Repeating an important point above:
How to get the number of memory references required in the BLAS?
After all, we can always write a code to
artificially increase the number of references, or do something
really dumb, by adding in a lot of useless operations.
This has actually been done in some research projects that managed to
get published, and it can have a valid role in parallel computing where
you want to trade off parallelism against duplicating operations
(in parallel computing, duplicating a computation on all processors
is usually much faster than computing on one and then sending the
result to the other processors).
The idea is to count the minimal number of memory references
required and the minimal number of flops to do the operation,
without extra memory being used.
Example: Matrix-vector product with update: y = y + A*x.
You must
- read A (n2 memory reads)
- read x (n memory reads)
- read y (n memory reads)
- write y (n memory reads)
so the minimal number of memory references is 3n + n2.
Each entry of A*x requires 2n-1 flops (n multiplies, n-1 additions)
and adding it to y requires another flop. So the total flops
needed are 2n2. That
gives (mem ref)/flop = (3n + n2)/2n2, which
asymptotically is mem/flop = limitn → infinity(3n+ n2)/2n2 → 1/2.
Compare this to the two versions
for computing the matrix-vector product above; what does it say
about the optimal way of doing it?
Go through the same exercise for
- the rank-1 update of a matrix: A = A + u*vT where A is a
n x n matrix and u, v are n x 1 vectors.
- solving a lower triangular system Lx = b for vector x, given n x n L
and n x 1 b.
- matrix-matrix multiply: C = C + A*B
For the matrix-matrix multiply, you can assume that the matrices involved
are all n × n. Of all
the BLAS mentioned, which one is the biggest winner in terms
of (theoretically achievable) performance?
BLAS Example: Gaussian elimination
The immediate thought with this analysis is "who cares, I'll never want to multiply
large matrices together". And that is correct; what scientists and engineers really need
to do is to solve differential equations, solve integral equations, etc. It is in the
implementation
of those higher level goals that the BLAS occur.
Being able to identify BLAS operations in a code, and choosing the best one for
that operation, is the single most important skill needed.
The load/store analysis is easiest for the linear algebra operations in the BLAS,
but the same methodology can be used just about anywhere in a code.
Load/store analysis tells you which BLAS are "better" than others, but spotting the
BLAS in a code takes some effort. As a general rule of thumb, when you have
- A loop, use BLAS-1 operations
- Doubly nested loops, use BLAS-2 operations
- Triply nested loops, use BLAS-3 operations
As an example, consider solving a linear system of equations
A * x = b ,
where A is n × n, x and b are n × 1.
A and b are given, and x is what needs to be found.
The hoary old solution is
Gaussian elimination.
The naive version of this (e.g., as found in
Numerical Recipes in C) is:
for r = 1:n-1
for c = r+1:n
A(c,r) = A(c,r)/A(r,r)
end
for c = r+1:n
for i = r+1:n
A(c,i) = A(c,i) - A(c,r)*A(r,i)
end
end
end
What BLAS-1 operation is the innermost loop? Slightly harder to recognize
from looking at the code, what is the BLAS-2 operation expressed
by the two nested inner loops?
- Started: Fri Jan 30 10:33:05 EST 2004
- Updated: Wed Jan 19 08:11:34 EST 2005 to eliminate redundant refs to Strassen
- Updated: Sun 09 Sep 2007, to clarify memory reference counting
- Updated: Tue 11 Sep 2007, to restate how to count mem refs
- Modified: Thu 29 Oct 2009, 12:59 PM to prettify and clarify the
mem ref to flop ratio determination.
- Last Modified: Thu 29 Oct 2009, 01:26 PM