Computational Kernels


As matrix-matrix multiply shows, using techniques to achieve high performance can cause the code to become dense and opaque. And as pointed out before, optimizing compilers cannot always do the work for us, either because the complexity of the code prevents it or because it requires changing the order of floating point operations, which can lead to different numerical results.

However, many scientific computations can be decomposed into small kernel operations, such as matrix-matrix multiplication. Examples include

Since kernels occur frequently, computer vendors can then provide highly optimized for their architectures. In many cases this means assembly language coded versions which fully utilize the computer's capabilities. Higher level algorithms can then be expressed in terms of those kernels, allowing high performance computations. However, this relies on your ability to recognize the kernels in your code. The prototype and original versions of kernel libraries are the BLAS for dense linear algebra.

The Basic Linear Algebra Subroutines

The BLAS (Basic Linear Algebra Subroutines) are the paleolithic computational kernel library. They are classified into three groups, depending on the type of operands involved. Reminder: the notation above follows the "Householder convention": Householder was one of the early founders of numerical linear algebra. Everyone found the convention he used in his book Principles of Numerical Analysis so useful it has become a de facto standard. Since the Greek letter set is not supported by all browsers, I usually spell them out.

Feeling a little guilty about not recognizing the linear algebra operations? You should know what all of the ones listed above do; if not you might check the Matrix Manual for any linear algebra terminology you don't recognize. (Unfortunately that is now a dead link, left here for now to see if I can track it down elsewhere.)


Choosing Which BLAS to Use

Knowing which BLAS will get good performance harks back to the basic idea that data movement needs to be minimized:
Whenever possible, use a BLAS routine with lower ratio of memory references to floating point operations.
That principle requires some explanation. Of course, there is no way to turn a dotproduct operation into a saxpy one. Instead, we have to seek opportunities to re-state algorithms so that in the innermost loop(s), they use a more favorable BLAS function. Also, clearly defining the number of memory references could depend upon the particular code. When analyzing the memory reference to flop count ratio, just use the mathematical statement of the problem, not the (nearly impossible to determine) memory references that a particular implementation has. This is like assuming that the machine has an infinite number of registers and only one level of memory. In that case, you are just counting up all the data from the vectors and matrices in the problem (to get the number of reads required), added to the size of the result (for the number of writes required).

For the flop count, assume the minimal possible to implement the algorithm without extra memory being used. That you can get by looking at a particular implementation. So in the ratio of m/f = memory references/flops, the numerator can be found from a code or algorithm statement, while the denominator can be found from a bald mathematical statement of the operation. The intent is to get a lower bound on the ratio, not an exact count.

This excludes "reduced order" methods like the Strassen algorithm for matrix-matrix multiplication. Those are completely different algorithms and give different results in finite precision arithmetic. More deadly, for the few scientific applications that involve dense (dense = mostly nonzero) matrices, the amount of memory is the limiting bottleneck and a library implementation that increases memory usage is usually unacceptable. Regardless, reduced order methods for matrix-matrix multiplication have their number of flops < O(n2.7) instead of the usual O(n3), and the asymptotic load-store analysis still holds.

Example: vector update y = y + alpha*x.

   for k = 1:n
      y(k) = y(k) + alpha*x(k)
   end for 
Since alpha is likely kept in a register (which even a brain-dead compiler will do), each iteration has two reads (x(k) and y(k)) and one write (y(k)) for three memory references. More correctly, we count the minimal number of memory references by looking at the mathematical operation. There are one scalar and two vectors to read, one to write, again giving the same thing: 3n+1 memory reference. Each iteration has an add and a multiply, giving a ratio of (mem ref)/(flops) = (3n+1)/2n.

Example: inner product alpha = xTy.

   alpha = 0.0
   for k = 1:n
      alpha = alpha + x(k)*y(k)
   end for 
This requires reading two vectors of length n and writing one scalar, giving 2n+1 memory references. Each iteration performs two flops, giving a ratio of (mem ref)/(flops) = (2n+1)/(2n-1) → 1. If x = y, that ratio is further reduced to 1/2.

Now consider matrix-vector product y = y + Ax implemented in two different ways:

Version 1:

for k = 1:n
   for i = 1:n
      y(k) = y(k) + A(k,i) * x(i)
   end for 
end for 

Version 2:

for i = 1:n
   for k = 1:n
       y(k) = y(k) + A(k,i) * x(i)
   end for 
end for 

Both versions perform the exact same amount of flops. Both versions also perform the summations which give an entry y(k) in the exact same order, so they are identical even in floating point arithmetic. Which one is preferable? To answer that, first we need to answer the question "What are the kernel operations expressed by the innermost loops?". Ignoring the row versus column storage issue for A, which is going to be faster? Why? How much faster might it be? The answer to this last question can tell you if it is worth the effort to convert the algorithm from one to another.

Actually, the matrix-vector product itself is a BLAS-2 operation, so probably it would not be implemented as a sequence of calls to a BLAS-1 kernel for reasons to soon be clear.


Counting "Memory References" in the BLAS

Repeating an important point above: How to get the number of memory references required in the BLAS? After all, we can always write a code to artificially increase the number of references, or do something really dumb, by adding in a lot of useless operations. This has actually been done in some research projects that managed to get published, and it can have a valid role in parallel computing where you want to trade off parallelism against duplicating operations (in parallel computing, duplicating a computation on all processors is usually much faster than computing on one and then sending the result to the other processors). The idea is to count the minimal number of memory references required and the minimal number of flops to do the operation, without extra memory being used.

Example: Matrix-vector product with update: y = y + A*x. You must

so the minimal number of memory references is 3n + n2. Each entry of A*x requires 2n-1 flops (n multiplies, n-1 additions) and adding it to y requires another flop. So the total flops needed are 2n2. That gives (mem ref)/flop = (3n + n2)/2n2, which asymptotically is mem/flop = limitn → infinity(3n+ n2)/2n2 → 1/2. Compare this to the two versions for computing the matrix-vector product above; what does it say about the optimal way of doing it?

Go through the same exercise for

For the matrix-matrix multiply, you can assume that the matrices involved are all n × n. Of all the BLAS mentioned, which one is the biggest winner in terms of (theoretically achievable) performance?

BLAS Example: Gaussian elimination

The immediate thought with this analysis is "who cares, I'll never want to multiply large matrices together". And that is correct; what scientists and engineers really need to do is to solve differential equations, solve integral equations, etc. It is in the implementation of those higher level goals that the BLAS occur. Being able to identify BLAS operations in a code, and choosing the best one for that operation, is the single most important skill needed. The load/store analysis is easiest for the linear algebra operations in the BLAS, but the same methodology can be used just about anywhere in a code. Load/store analysis tells you which BLAS are "better" than others, but spotting the BLAS in a code takes some effort. As a general rule of thumb, when you have

As an example, consider solving a linear system of equations A * x = b , where A is n × n, x and b are n × 1. A and b are given, and x is what needs to be found. The hoary old solution is Gaussian elimination. The naive version of this (e.g., as found in Numerical Recipes in C) is:

    for r = 1:n-1
        for c = r+1:n
            A(c,r) = A(c,r)/A(r,r)
        end
        for c = r+1:n
            for i = r+1:n
                A(c,i) = A(c,i) - A(c,r)*A(r,i)
            end 
        end 
    end

What BLAS-1 operation is the innermost loop? Slightly harder to recognize from looking at the code, what is the BLAS-2 operation expressed by the two nested inner loops?