How to Apply Architectural Knowledge in Code

In almost every computation a great variety of arrangements for the succession of the processes is possible, and various considerations must influence the selections amongst them for the purposes of a calculating engine. One essential object is to choose that arrangement which shall tend to reduce to a minimum the time necessary for completing the calculation.

From the world's first programmer also known as the Countess of Lovelace(*).

The performance benefit of kernels requires first identifying them in a code, then substituting calls to optimized library versions (or, worst case: write optimized library versions yourself). The analytical models provide a reasonably predictive estimate of whether or not it is worthwhile to pursue that path. Much more difficult, and much more rewarding, is to change the implementation of an algorithm so that it will use faster-running kernel operations. First, however, how do the BLAS get high performance? A load/store analysis may indicate that an implementation exists that gets good performance regardless of the cachesize, but it doesn't indicate what that implementation would be. Using assembler language only goes so far - if an algorithm requires many memory references per floating point operation, the code cannot run fast even in assembly. Enhancing pipelining and data locality in a program can give an even larger boost than writing in a low-level language.

Dense and sparse matrices

First, an important distinction is needed. A dense m×n matrix is one where most or all of its m*n elements are non-zero. Usually this implies that it is best stored as a 2D array. A sparse matrix is one where the number of known zero entries, or their locations, make it worthwhile to use a different representation. For example, a tridiagonal matrix is better stored as three vectors with n entries each, than as a n×n array with n2 entries. In addition to structured matrices like lower triangular or tridiagonal ones, a matrix where at most 9 entries per row are nonzero is also better represented using something other than a full n×n 2D array.

Optimization techniques are relatively easy for dense arrays, as will be shown in detail for matrix-matrix multiply. The methods are equally important but more difficult for sparse matrix computations, or ones using more general data structures like graphs or trees. These computations typically involve indirect addressing of the form x(col(i)), where col() is a vector of indices (integers). The indirect addressing typically takes the form of multilevel pointer dereferencing in C/C++, such as *x[i]. So for sparse matrices,

In some network and graph analysis areas, a matrix may have a wide range of nonzeros in different rows, and in those cases a hybrid data structure is used which splits the matrix as A = As + Ad, where As is stores the sparse rows and Ad contains the dense rows. This technique has long been used in the linear programming community, going back at least to the 1960s.

For the examples of using cache locality and pipelining, only dense matrices are considered here. Extending load-store analysis to sparse matrices is straightforward, but deferred to later.

Optimization tradeoffs

Pipelining and data locality also require a tradeoff. Long vector lengths help amortize the start-up and draining off costs of pipelines, but long vector lengths can lead to cache overflow, causing more cache misses. In choosing between the two, go with data locality. The reason: typically, a cache miss costs 10 times as much as a cache hit, so in the extreme case enhancing locality could make a code run 10 times faster. On the other hand, most pipelines in most CPUs have a depth of 2-3, so at most they can triple performance.

Sometimes optimizing compilers cannot do all of the heavy lifting. Partly this is because of the danger of reordering operations - changing the order of associative operations can lead to different numerical results, which is undesirable in scientific computing (a key property of science is that results are at least theoretically reproducible). Another part is simply the complexity of realistic applications codes prevents many auto-optimizations.


Matrix-matrix Multiplication

The classical example of how to use data locality and pipelining ideas is the dense matrix-matrix multiplication operation C = A*B, where A, B, and C are n×n matrices stored in 2D arrays. This is implemented a library routine dgemm, which actually computes the more general operation C = α*A*B + β*C, with α and β scalars and For simplicity, the following assumes n1 = n2 = n3 = n.

Recall the key optimization idea: when a datum is loaded into the cache, do as many operations as possible with it before going on to next datum. Secondarily, try to allow and enhance pipelining opportunities.


Standard Version of Matrix-Matrix Multiply

For a matrix-matrix product with n1 = n2 = n3 = n, the standard implementation gives a triple nested loop


      for i = 1:n
         for j = 1:n
            temp = 0.0
            for k = 1:n
               temp = temp + A(i,k)*B(k,j)
            end for
            C(i,j) = temp
         end for
      end for

Each outer (i,j) loops computes a single entry of C as the inner product of row i of A and column j of B. Suppose that the matrices are stored in 2D arrays, and they are mapped in "column-major" order; that is, they are stored column-by-column in memory. Then each time a blue line is crossed while computing C(i,j), a cache miss is generated:

[matmat1large.png]

For n larger than the cacheline size c, every access going across a row of the array A causes a cache miss. For the array B, a cache miss occurs every c entries while traversing down a column.


Matrix-matrix multiply with blocking

The theoretical ratio of the minimum possible memory references to floating point operations is 3n2/2n3. A and B must be read, and C written for the matrix*matrix product, giving m = 3n2. Computing each element if C is a dotproduct of length n, so it takes f = n2*n flops overall. Asymptotically the ratio r = m/f goes to zero, so some implementation exists which can get near peak performance. The usual technique for enhancing cache re-use is blocking: partition the matrices into sub-blocks, such that at least three small sub-blocks will fit into cache at the same time. If the sub-blocks are s×s, that means making sure 3s2 is less than the cache size in double words. [In other computer science settings, blocking is called deferred updates or lazy evaluation ].

Simply blocking the matrix operations does not give faster code. To do that, remove the worst data locality criminal - the accesses to A that cause a cache miss on every consecutive access to elements in a row of A. The idea is to rearrange the computations so that when a sub-block of A is brought into the cache, it is kept there until all computations involving it have completed. By itself that won't help, since accessing across the rows of A's sub-block still cause a cache miss for every element. So copy the sub-block of A into a temporary array T of size s×s, transposing the sub-block during the copying process. To keep clear that s is a blocksize, the variable name "blocksize" is used instead of "s" in the following algorithm pseodo-code. The indexing convention is that

   for ii = 1:blocksize:n
means to go from 1 to n, in steps of size blocksize. A percent sign (%) is used as a comment line indicator. Then a matrix-matrix product code fragment is
    for ii = 1:blocksize:n
        for jj = 1:blocksize:n
            %--------------------------------------------------------
            % Copy over block (ii/blocksize,jj/blocksize) of A to T,
            % transposing it as it is copied.
            %--------------------------------------------------------
            for i = 1:min(blocksize, n-ii+1)
                for j = 1:min(blocksize, n-jj+1)
                    T(j,i) = A(ii+i-1, jj+j-1)
                end for 
            end for 

            %--------------------------------
            % For each column of C and B ...
            %--------------------------------
            for k = 1:n

                %---------------------------------------------------------
                % ... compute product of block (ii/blocksize,jj/blocksize)
                % of A with the k-th column of B
                %---------------------------------------------------------
                for i = 1:min(blocksize, n-ii+1)
                    temp = 0.0
                    for j = 1:min(blocksize, n-jj+1)
                        temp = temp + T(j,i)*B(jj+j-1, k)
                    end for
                    C(i+ii-1,k) = temp
                end for
            end for 
        end for  % jj loop
    end for  % ii loop

with the corresponding picture:

[matmat2large.png]

The code above is almost runnable Matlab. It can be expressed more succintly (and clearly) using Matlab index vectors - see the fuller Matlab function.


Adding in pipelining

Blocking enhances data locality, but not pipelining. To improve pipelining, the idea is to keep the same kind of blocking (and so retain its benefits), and do loop unrolling within the blocks. Here this is done by computing a small 2×2 sub-block of entries within a given block of C, but in general the unrolling depth should be greater than 2.
    for ii = 1:blocksize:n
        for jj = 1:blocksize:n
            %--------------------------------------------------------
            % Copy over block (ii/blocksize,jj/blocksize) of A to T,
            % transposing it as it is copied.
            %--------------------------------------------------------
            for i = 1:min(blocksize, n-ii+1)
                for j = 1:min(blocksize, n-jj+1)
                    T(j,i) = a(ii+i-1, jj+j-1)
                end for 
            end for 

            %------------------------------
            % Compute a 2x2 sub-block in C
            %------------------------------
            for k = 1:2:n
                for i = 1:2:min(blocksize, n-ii+1)
                    t11 = 0.0
                    t21 = 0.0
                    t12 = 0.0
                    t22 = 0.0
                    for j = 1:min(blocksize, n-jj+1)
                        t11 = t11 + T(j,i)*b(jj+j-1, k)
                        t21 = t21 + T(j,i+1)*b(jj+j-1, k)
                        t12 = t12 + T(j,i)*b(jj+j-1, k+1)
                        t22 = t22 + T(j,i+1)*b(jj+j-1, k+1)
                    end for
                    c(i+ii-1,k)   = t11
                    c(i+ii,k)     = t21
                    c(i+ii-1,k+1) = t12
                    c(i+ii,k+1)   = t22
                end for
            end for 

This code fragment assumes that the matrix order n is evenly divisible by 2. It requires clean up (AKA preconditioning) loops to be correct. As an informal exercise to help your understanding of the operations going on, you might try writing the version valid for matrices that are not of even size. Just be aware that there are several cases to consider for the temporary block T when mod(n,blocksize) is not 0.

Also note:

You may be better off to go with the dumb triple-loop version than with the fancy blocked/unrolled one, and then use a high level of compiler optimization. Better yet, use the BLAS.


Performance effects of blocking and unrolling in matrix-matrix multiply

The effects of these differing versions have been tested on several systems, and here are a few examples. The actual implementations are explained in excruciatingly detail, but the one marked "3Loop" is the usual triple-nested loop. First, on an Intel Core 2 Duo, using a high level of optimization in the compiler:

[fours.png]

BTW: This and the other graphs on this page are hugh so that you can extract more info from them, not because Bramley is older than dirt and doesn't know how to resize images. I actually do know how: I ask a student to do it.

Performance peters out around n = 360 for the 3Loop version. For that size of n, the data involved is

8*(n2) = 8*(3602) = 1036800 bytes

which closely matches the 1 Mbyte = 1048576 byte size of the second level cache on the machine used. The second level cache is a combined data and instruction cache, so the performance is expected to drop even before that size is actually reached. Some references to C(i,j) are also required outside the innermost loop, and those references consume a few additional cachelines.

The crucial feature is that the versions with blocking have performance not bound by the cachesize, and the rate is sustained until the limits of memory are reached. For a system with 8 Gbytes of memory, the performance does not decrease until the 3 matrices are of order n ≥ 16000.


Performance including BLAS tests

The BLAS library function for doing matrix-matrix multiplication is called dgemm(), and the next graph includes its performance curve in black, with the other four versions the same as plotted above.
[fives.png]

The optimized BLAS library on the system uses multiple cores, even if only one thread is specified during compilation. dgemm() does so well that the other four implementations are squashed together down at the bottom. However, using multiple cores doesn't give a fair comparison to implementations using only one.

The whole reason for modifying the implementation is to sustain good performance well past the cache size. A matrix order of 1500 corresponds to data of about 17 Mbytes, larger than the cache of the machine. Unrolling version retains those benefits, and yields an extra 15-20 Mflop/second.

The blocked and unrolled version gets around 80% of the performance of the highly optimized one from the vendor. This is typical; about 80% of the theoretical peak performance of a machine can be obtained from a C/C++ or Fortran code for matrix-matrix operations. Subject to one parameter (the blocksize that lets three sub-blocks fit into cache), that C++ or Fortran code is portable across machines.

This behaviour is typical:

The best-effort version labeled "Block-Unroll" gets around 225 Mflop/sec, while the vendor supplied version gets about 275 Mflop/second. At least Block-Unroll is getting 82% of the performance, which ain't shabby. Again, the critical feature is that performace is sustained even when using more data than will fit into cache. [Can you roughly estimate the cachesize on the last plot?]

The triple-nested loop version drops for n > 200, corresponding to a data size of

   (order of matrix)2*8 bytes = 2002*8 = 320000 bytes, 

On older Sun machines results differ significantly. dgemm()'s performance is shown in black in the next plot.


Versions of matrix-matrix multiply on an older Sun workstation

[scrap.jpg]

The triple-nested loop version was terminated at a smaller value of n becauase it was taking forever. Trust me, it never got better. The version with blocking but no transpose of the A matrix gives identical results to the dgemm() version, even following the jagged dips and rises almost point for point.


More modern platforms

The SGI and Sun systems are antiques (but useful to see how era-dependent the ideas of architectural features are). On a quad-core Intel i7 running at 2.8 Ghertz, the next results are for a larger set of matrix-matrix implementations. In addition to the ones described before, three additional ones come from using Fortran. The Fortran language provides intrinsic functions , ones which are available to be used without linking in additional libraries or include files. Among them are dot_product() for dotproducts and matmul() for matrix-matrix multiplication. The actual implementation of those is up to the compiler vendor. The full set of methods tested are:
  1. TripleLoop: triple nested loop
  2. block: blocking
  3. block+transpose: blocking and transposing
  4. block+transpose+unroll: blocking, transposing, and unrollign
  5. MKL dgemm: MKL BLAS
  6. Fortran dgemm: calling a Fortran version of dgemm from Univ. Tenn
  7. Fortran dotp: calling Fortran's intrinsic dot_product() function
  8. Fortran matmul: calling Fortran's intrinsic matmul() function
Excluding dgemm() temporarily:

Versions of matrix-matrix multiply on charis

[odin.jpg]

The times were acquired with the Intel version 13.0 compiler with full optimization using the -O3 option. Notice that

Plotting the same results but including dgemm() gives:

Versions of matrix-matrix multiply on charis

[odin.jpg]

As usual with vendor-optimized BLAS, dgemm() does significantly better. One feature worth noticing, especially for computer scientists who love powers of 2: The performance of 7 of the 8 methods has significant drops when the matrix order is a power of 2. Both plots have vertical dotted lines at powers of two to demonstrate this, and it shows the need to test a range of values that does not just use multiples of 4 or powers of 2.


Finally, results on a 64 bit Opteron using the PGI compiler and library:

Versions of matrix-matrix multiply on odin cluster

[odin.jpg]

PGI must have used a unoptimized triple-nested loop for their dgemm(), because it is down near the bottom with 3loop. Again the blocked versions get sustained performance well beyond the cache size on the different machines. This is what the BLAS load-store analysis indicates is possible.

Interesting research in this area has developed libraries of algorithms like matrix-matrix multiplication which adaptively optimize their parameters for whatever architecture and application they are being used with. Such systems include ATLAS at Tennessee and PhiPAC at Berkeley. Also, because of the unique capabilities in C++, some projects provide "self-optimizing libraries" which dynamically at run-time find the correct parameters with some overhead.

Summary ideas:

Dense matrix-matrix multiplication is used by few modern research codes in scientific computing. So why on earth trudge through this too-long web page? Dense matrix-matrix multiplication is a best-case problem. If it cannot achieve high performance, then more interesting computational kernels won't either. Most importantly, the techniques used for getting and modeling performance are more widely useful than shown here, but the linear algebraic operations provide the clearest examples of doing so.


(*) For more about Ada Lovelace, it's more than worthwhile to read the introduction to her and Charles Babbage documented at the 2Dgoggles site. The story there is true, at least up until she tries to tweet using her fan ....