In almost every computation a great variety of arrangements for the succession of the processes is possible, and various considerations must influence the selections amongst them for the purposes of a calculating engine. One essential object is to choose that arrangement which shall tend to reduce to a minimum the time necessary for completing the calculation.
From the world's first programmer also known as the Countess of Lovelace(*).
First, an important distinction is needed. A dense m×n matrix is one where most or all of its m*n elements are non-zero. Usually this implies that it is best stored as a 2D array. A sparse matrix is one where the number of known zero entries, or their locations, make it worthwhile to use a different representation. For example, a tridiagonal matrix is better stored as three vectors with n entries each, than as a n×n array with n2 entries. In addition to structured matrices like lower triangular or tridiagonal ones, a matrix where at most 9 entries per row are nonzero is also better represented using something other than a full n×n 2D array.
Optimization techniques are relatively easy for dense arrays, as will be shown in detail for matrix-matrix multiply. The methods are equally important but more difficult for sparse matrix computations, or ones using more general data structures like graphs or trees. These computations typically involve indirect addressing of the form x(col(i)), where col() is a vector of indices (integers). The indirect addressing typically takes the form of multilevel pointer dereferencing in C/C++, such as *x[i]. So for sparse matrices,
For the examples of using cache locality and pipelining, only dense matrices are considered here. Extending load-store analysis to sparse matrices is straightforward, but deferred to later.
Pipelining and data locality also require a tradeoff. Long vector lengths help amortize the start-up and draining off costs of pipelines, but long vector lengths can lead to cache overflow, causing more cache misses. In choosing between the two, go with data locality. The reason: typically, a cache miss costs 10 times as much as a cache hit, so in the extreme case enhancing locality could make a code run 10 times faster. On the other hand, most pipelines in most CPUs have a depth of 2-3, so at most they can triple performance.
Sometimes optimizing compilers cannot do all of the heavy lifting. Partly this is because of the danger of reordering operations - changing the order of associative operations can lead to different numerical results, which is undesirable in scientific computing (a key property of science is that results are at least theoretically reproducible). Another part is simply the complexity of realistic applications codes prevents many auto-optimizations.
Recall the key optimization idea: when a datum is loaded into the cache, do as many operations as possible with it before going on to next datum. Secondarily, try to allow and enhance pipelining opportunities.
For a matrix-matrix product with n1 = n2 = n3 = n, the standard implementation gives a triple nested loop
for i = 1:n for j = 1:n temp = 0.0 for k = 1:n temp = temp + A(i,k)*B(k,j) end for C(i,j) = temp end for end for
For n larger than the cacheline size c, every access going across a row of the array A causes a cache miss. For the array B, a cache miss occurs every c entries while traversing down a column.
Simply blocking the matrix operations does not give faster code. To do that, remove the worst data locality criminal - the accesses to A that cause a cache miss on every consecutive access to elements in a row of A. The idea is to rearrange the computations so that when a sub-block of A is brought into the cache, it is kept there until all computations involving it have completed. By itself that won't help, since accessing across the rows of A's sub-block still cause a cache miss for every element. So copy the sub-block of A into a temporary array T of size s×s, transposing the sub-block during the copying process. To keep clear that s is a blocksize, the variable name "blocksize" is used instead of "s" in the following algorithm pseodo-code. The indexing convention is that
for ii = 1:blocksize:nmeans to go from 1 to n, in steps of size blocksize. A percent sign (%) is used as a comment line indicator. Then a matrix-matrix product code fragment is
for ii = 1:blocksize:n for jj = 1:blocksize:n %-------------------------------------------------------- % Copy over block (ii/blocksize,jj/blocksize) of A to T, % transposing it as it is copied. %-------------------------------------------------------- for i = 1:min(blocksize, n-ii+1) for j = 1:min(blocksize, n-jj+1) T(j,i) = A(ii+i-1, jj+j-1) end for end for %-------------------------------- % For each column of C and B ... %-------------------------------- for k = 1:n %--------------------------------------------------------- % ... compute product of block (ii/blocksize,jj/blocksize) % of A with the k-th column of B %--------------------------------------------------------- for i = 1:min(blocksize, n-ii+1) temp = 0.0 for j = 1:min(blocksize, n-jj+1) temp = temp + T(j,i)*B(jj+j-1, k) end for C(i+ii-1,k) = temp end for end for end for % jj loop end for % ii loop
The code above is almost runnable Matlab. It can be expressed more succintly (and clearly) using Matlab index vectors - see the fuller Matlab function.
for ii = 1:blocksize:n for jj = 1:blocksize:n %-------------------------------------------------------- % Copy over block (ii/blocksize,jj/blocksize) of A to T, % transposing it as it is copied. %-------------------------------------------------------- for i = 1:min(blocksize, n-ii+1) for j = 1:min(blocksize, n-jj+1) T(j,i) = a(ii+i-1, jj+j-1) end for end for %------------------------------ % Compute a 2x2 sub-block in C %------------------------------ for k = 1:2:n for i = 1:2:min(blocksize, n-ii+1) t11 = 0.0 t21 = 0.0 t12 = 0.0 t22 = 0.0 for j = 1:min(blocksize, n-jj+1) t11 = t11 + T(j,i)*b(jj+j-1, k) t21 = t21 + T(j,i+1)*b(jj+j-1, k) t12 = t12 + T(j,i)*b(jj+j-1, k+1) t22 = t22 + T(j,i+1)*b(jj+j-1, k+1) end for c(i+ii-1,k) = t11 c(i+ii,k) = t21 c(i+ii-1,k+1) = t12 c(i+ii,k+1) = t22 end for end for
Also note:
BTW: This and the other graphs on this page are hugh so that you can extract more info from them, not because Bramley is older than dirt and doesn't know how to resize images. I actually do know how: I ask a student to do it.
Performance peters out around n = 360 for the 3Loop version. For that size of n, the data involved is
which closely matches the 1 Mbyte = 1048576 byte size of the second level cache on the machine used. The second level cache is a combined data and instruction cache, so the performance is expected to drop even before that size is actually reached. Some references to C(i,j) are also required outside the innermost loop, and those references consume a few additional cachelines.
The crucial feature is that the versions with blocking have performance not bound by the cachesize, and the rate is sustained until the limits of memory are reached. For a system with 8 Gbytes of memory, the performance does not decrease until the 3 matrices are of order n ≥ 16000.
The optimized BLAS library on the system uses multiple cores, even if only one thread is specified during compilation. dgemm() does so well that the other four implementations are squashed together down at the bottom. However, using multiple cores doesn't give a fair comparison to implementations using only one.
The whole reason for modifying the implementation is to sustain good performance well past the cache size. A matrix order of 1500 corresponds to data of about 17 Mbytes, larger than the cache of the machine. Unrolling version retains those benefits, and yields an extra 15-20 Mflop/second.
The blocked and unrolled version gets around 80% of the performance of the highly optimized one from the vendor. This is typical; about 80% of the theoretical peak performance of a machine can be obtained from a C/C++ or Fortran code for matrix-matrix operations. Subject to one parameter (the blocksize that lets three sub-blocks fit into cache), that C++ or Fortran code is portable across machines.
This behaviour is typical:
The triple-nested loop version drops for n > 200, corresponding to a data size of
(order of matrix)2*8 bytes = 2002*8 = 320000 bytes,
On older Sun machines results differ significantly. dgemm()'s performance is shown in black in the next plot.
The triple-nested loop version was terminated at a smaller value of n becauase it was taking forever. Trust me, it never got better. The version with blocking but no transpose of the A matrix gives identical results to the dgemm() version, even following the jagged dips and rises almost point for point.
The times were acquired with the Intel version 13.0 compiler with full optimization using the -O3 option. Notice that
Plotting the same results but including dgemm() gives:
As usual with vendor-optimized BLAS, dgemm() does significantly better. One feature worth noticing, especially for computer scientists who love powers of 2: The performance of 7 of the 8 methods has significant drops when the matrix order is a power of 2. Both plots have vertical dotted lines at powers of two to demonstrate this, and it shows the need to test a range of values that does not just use multiples of 4 or powers of 2.
Finally, results on a 64 bit Opteron using the PGI compiler and library:
PGI must have used a unoptimized triple-nested loop for their dgemm(), because it is down near the bottom with 3loop. Again the blocked versions get sustained performance well beyond the cache size on the different machines. This is what the BLAS load-store analysis indicates is possible.
Interesting research in this area has developed libraries of algorithms like matrix-matrix multiplication which adaptively optimize their parameters for whatever architecture and application they are being used with. Such systems include ATLAS at Tennessee and PhiPAC at Berkeley. Also, because of the unique capabilities in C++, some projects provide "self-optimizing libraries" which dynamically at run-time find the correct parameters with some overhead.
Dense matrix-matrix multiplication is used by few modern research codes in scientific computing. So why on earth trudge through this too-long web page? Dense matrix-matrix multiplication is a best-case problem. If it cannot achieve high performance, then more interesting computational kernels won't either. Most importantly, the techniques used for getting and modeling performance are more widely useful than shown here, but the linear algebraic operations provide the clearest examples of doing so.