Example of Using Architecture Ideas


The performance benefit of kernels requires that you first identify them in a code, then substitute calls to optimized library versions. The analytical benefit is that you can get a decent estimate of whether or not it is worthwhile to pursue that path. Much more difficult, and much more rewarding, is to change the implementation of an algorithm so that it will use a faster-running kernel operation. First, however, how do the BLAS get high performance? Using assembler language only goes so far - if you're still stuck with many memory references per floating point operation, the code still won't run fast even in assembly.

Enhancing pipelining and data locality in a program actually gives a larger boost than writing in low-level languages. The techniques are relatively easy for dense arrays, as will be shown in detail for matrix-matrix multiply. They are equally important but more difficult for sparse matrix computations, or ones using more general data structures like graphs or trees. These computations typically involve lots of indirect addressing of the form x(col(i)), where col() is a vector of indices. The indirect addressing typically takes the form of multilevel pointer dereferencing in C/C++, such as *x[i]. Then

There is often a tradeoff between pipelining and data locality. Long vector lengths help amortize the start-up and draining costs of pipelines, but long vector lengths can lead to cache overflow, causing more cache misses. If you have to choose between the two, go with data locality. The reason: typically, a cache miss costs 10 times as much as a cache hit, so in the extreme case enhancing locality could make your code run 10 times faster. On the other hand, most pipelines in workstation functional units have a depth of 2-3, so at most they can triple your performance.

Keep in mind also that optimizing compilers don't always do the work for us. Part of this is the danger of reordering operations - changing the order of associative operations can lead to different numerical results, which is undesirable in scientific computing (remember, a key defining property of science is that results are reproducible). Another part is simply the complexity of realistic applications codes prevents many auto-optimizations.


Matrix-matrix Multiplication

The classical example of how to use data locality and pipelining ideas is the matrix-matrix multiplication operation C = A*B, where A, B, and C are n x n matrices. This is implemented on virtually all workstations and high performance architectures in a library routine dgemm, which actually computes the more general operation C = alpha*A*B + beta*C, with alpha and beta scalars and Recall the key optimization idea: when you bring a datum into cache, do as many operations as possible with it before going on to next datum. Simultaneously, try to allow pipelining.

Standard Version of Matrix-Matrix Multiply

Now consider the usual triple nested loop for a matrix-matrix product, with n1 = n2 = n3 = n:

      for i = 1:n
         for j = 1:n
            temp = 0.0d0
            for k = 1:n
               temp = temp + A(i,k)*B(k,j)
            end for
            C(i,j) = temp
         end for
      end for
Each outer (i,j) iteration computes a single entry of C as the inner product of row i of A and column j of B. Suppose that the matrices are stored in 2D arrays, and they are mapped in "column-major" order; that is, they are stored column-by-column in memory. Then each time you cross a blue line below while computing C(i,j), a cache miss is generated:

[matmat1large.png]

For n larger than what fits into the cacheline size, every access to the array A by rows will cause a cache miss.


Matrix-matrix multiply with blocking

Clearly we can do better, because the theoretical ratio of memory references to floating point operations is 2n3/3n2, so there is some implementation guaranteed to get near peak performance. The usual technique is to use blocking: partition the matrices into subblocks, such that at least three small subblocks will fit into cache at the same time. If the subblocks are s x s, that means making sure 3s2 is less than the cache size in double words.

Simply blocking the matrices does not give you faster code. To do that, remove the worst data locality criminal - the accesses to A. The idea is to rearrange the computations so that when a subblock of A is brought into the cache, you keep it there until all computations involving it have completed. By itself that won't help, since accessing across the rows of A's subblock will still cause cache misses. So copy the subblock of A into a temporary array T of size s x s, and transpose the subblock during that copying process. To keep clear that s is a blocksize, use the variable name "blocksize" instead of "s". The indexing convention used here is that

   for ii = 1:blocksize:n
means to go from 1 to n, insteps of blocksize. That gives the code fragment
   for ii = 1:blocksize:n
      for jj = 1:blocksize:n
%        ------------------------------------------------------
%        Copy over block (ii/blocksize,jj/blocksize) of A to T,
%        transposing it as it is copied.
%        ------------------------------------------------------
         for i = 1:min(blocksize, n-ii+1)
            for j = 1:min(blocksize, n-jj+1)
               T(j,i) = A(ii+i-1, jj+j-1)
            end for 
         end for 

%        ------------------------------
%        For each column of C and B ...
%        ------------------------------
         for k = 1:n

%        --------------------------------------------------------
%        ... compute product of block (ii/blocksize,jj/blocksize)
%        of A with the k-th column of B
%        --------------------------------------------------------
            for i = 1:min(blocksize, n-ii+1)
               temp = 0.0d0
               for j = 1:min(blocksize, n-jj+1)
                 temp = temp + T(j,i)*B(jj+j-1, k)
               end for
               C(i+ii-1,k) = temp
            end for
         end for 

      end for  % jj loop
   end for  % ii loop
with the corresponding picture:

[matmat2large.png]

The code above is almost runnable Matlab. It can be expressed more succintly using Matlab index vectors - see the fuller Matlab function. Blocking enhances data locality, but not pipelining. To improve pipelining, the idea is to keep the same kind of blocking (and so retain its benefits), and do loop unrolling within the blocks. We can do this by computing a small 2x2 sub-block of entries within a given block of C:

   for ii = 1:blocksize:n
      for jj = 1:blocksize:n
%        ------------------------------------------------------
%        Copy over block (ii/blocksize,jj/blocksize) of A to T,
%        transposing it as it is copied.
%        ------------------------------------------------------
         for i = 1:min(blocksize, n-ii+1)
            for j = 1:min(blocksize, n-jj+1)
               T(j,i) = a(ii+i-1, jj+j-1)
            end for 
         end for 

%        ---------------------------
%        Compute a 2x2 subblock in C
%        ---------------------------
         for k = 1:2:n
            for i = 1:2:min(blocksize, n-ii+1)
               t11 = 0.0d0
               t21 = 0.0d0
               t12 = 0.0d0
               t22 = 0.0d0
               for j = 1:min(blocksize, n-jj+1)
                   t11 = t11 + T(j,i)*b(jj+j-1, k)
                   t21 = t21 + T(j,i+1)*b(jj+j-1, k)
                   t12 = t12 + T(j,i)*b(jj+j-1, k+1)
                   t22 = t22 + T(j,i+1)*b(jj+j-1, k+1)
               end for
               c(i+ii-1,k)   = t11
               c(i+ii,k)     = t21
               c(i+ii-1,k+1) = t12
               c(i+ii,k+1)   = t22
            end for
         end for 
The code fragment above assumes that the matrix order n is evenly divisible by 2. It requires clean up (AKA preconditioning) loops to be correct. As an informal exercise to help your understanding of the operations going on, you might try writing the version valid for matrices that are not of even size. Just beware that there are several cases to consider for the temporary block T, however when mod(n,blocksize) is not 0.

Also note:

You may be better off to go with the dumb triple-loop version than with the fancy blocked/unrolled one, and then use a high level of compiler optimization. Better yet, use the BLAS.


Results of blocking and unrolling in matrix-matrix multiply

The effects of these differing versions have been tested on several systems, and here are a few examples. The actual implementations are explained in excruciatingly detail, but the one marked "3Loop" is the usual triple-nested loop. First, on an Intel Core 2 Duo, using a high level of optimization in the compiler:

[fours.png]
[This and the other graphs on this page are hugh so that you can extract some info from them].

Performance peters out around n = 360 for the 3Loop version. For that size of n, the data involved is

8*(n2) = 8*(3602) = 1036800 bytes.

which closely matches the 1 Mbyte = 1048576 byte size of the second level cache on the machine used. The second level cache is a combined data and instruction cache, so we would expect the performance to drop slightly before that size is actually reached. There are also the references to C(i,j) in the code that are outside the innermost loop but still would involve some cachelines being brought in.

There is a library function for doing matrix-matrix multiplication on every architecture - including laptops running the Linux operating system. It is called dgemm(), and its performance is shown in black with the other four versions.

[fives.png]

The optimized BLAS library on the system uses multiple cores, even if you specify that only one thread should be run. It does so well that the other four implementations are squashed together down at the bottom. However, using multiple cores doesn't give a fair comparison to implementations using only one.

Note that the blocked version has good performance, sustained well past the cache size (a matrix order of 1500 corresponds to data of about 17 Mbytes). The unrolled version retains those benefits, and gets an extra 15-20 Mflop boost as well.

The blocked, unrolled version gets around 80% of the performance of the highly optimized one from the vendor. This is typical; you can achieve about 80% of the peak performance of the machine you are using from a C/C++ or Fortran code. And subject to one parameter (the blocksize that lets three subblocks fit into cache), that C or Fortran code is portable across machines.

Since the multicore versus single core issue helps cloud the results, here's results from other systems:


Results on Sparc chips running the Solaris OS:

[nations.jpg]

On older Sun machines results differ significantly:

Versions of matrix-matrix multiply on an older Sun workstation

[scrap.jpg]
Notice that the version with blocking but no transpose of the A matrix gives identical results to the dgemm version, even following the jagged dips and rises almost point for point.

Finally, the results on a (slightly) more modern processor, the 64 bit Opteron using the PGI compiler and library:

Versions of matrix-matrix multiply on odin cluster

[odin.jpg]

The feature to note here is that PGI must have used a unoptimized triple-loop version for their dgemm, because it is down near the bottom with 3loop. Most importantly in all of these cases the blocked versions get sustained performance well beyond the cache size on the different machines. This is what the BLAS load-store analysis tells us is possible.

An interesting recent direction of research in this area is for libraries of algorithms like matrix-matrix multiplication which adaptively optimize such parameters for whatever architecture and application they are being used with. Such systems include ATLAS at Tennessee and PhiPAC at Berkeley. Also, because of unique capabilities in C++ some projects target "self-optimizing libraries" which dynamically at run-time find the correct parameters with minimal overhead.

Dense matrix-matrix multiplication is of little research interest. So why on earth did we spend a too-long web page on it? Dense matrix-matrix multiplication is a "best-case" problem, in that if you cannot get high performance with it, then your approach is probably not going help in other, more interesting, computational kernels. More importantly, the techniques used for getting and modeling performance, are useful more widely, not just in linear algebraic operations.


Next page: Using the BLAS in a code