Enhancing pipelining and data locality in a program actually gives a larger boost than writing in low-level languages. The techniques are relatively easy for dense arrays, as will be shown in detail for matrix-matrix multiply. They are equally important but more difficult for sparse matrix computations, or ones using more general data structures like graphs or trees. These computations typically involve lots of indirect addressing of the form x(col(i)), where col() is a vector of indices. The indirect addressing typically takes the form of multilevel pointer dereferencing in C/C++, such as *x[i]. Then
There is often a tradeoff between pipelining and data locality. Long vector lengths help amortize the start-up and draining costs of pipelines, but long vector lengths can lead to cache overflow, causing more cache misses. If you have to choose between the two, go with data locality. The reason: typically, a cache miss costs 10 times as much as a cache hit, so in the extreme case enhancing locality could make your code run 10 times faster. On the other hand, most pipelines in workstation functional units have a depth of 2-3, so at most they can triple your performance.
Keep in mind also that optimizing compilers don't always do the work for us. Part of this is the danger of reordering operations - changing the order of associative operations can lead to different numerical results, which is undesirable in scientific computing (remember, a key defining property of science is that results are reproducible). Another part is simply the complexity of realistic applications codes prevents many auto-optimizations.
Now consider the usual triple nested loop for a matrix-matrix product, with n1 = n2 = n3 = n:
for i = 1:n
for j = 1:n
temp = 0.0d0
for k = 1:n
temp = temp + A(i,k)*B(k,j)
end for
C(i,j) = temp
end for
end for
Each outer (i,j) iteration computes a single entry of C as the
inner product of row i of A and column j of B. Suppose that the matrices
are stored in 2D arrays, and they are mapped in "column-major" order; that is,
they are stored column-by-column in memory. Then each time you cross a
blue line below while computing C(i,j), a cache miss is generated:
For n larger than what fits into the cacheline size, every access to the array A by rows will cause a cache miss.
Simply blocking the matrices does not give you faster code. To do that, remove the worst data locality criminal - the accesses to A. The idea is to rearrange the computations so that when a subblock of A is brought into the cache, you keep it there until all computations involving it have completed. By itself that won't help, since accessing across the rows of A's subblock will still cause cache misses. So copy the subblock of A into a temporary array T of size s x s, and transpose the subblock during that copying process. To keep clear that s is a blocksize, use the variable name "blocksize" instead of "s". The indexing convention used here is that
for ii = 1:blocksize:nmeans to go from 1 to n, insteps of blocksize. That gives the code fragment
for ii = 1:blocksize:n
for jj = 1:blocksize:n
% ------------------------------------------------------
% Copy over block (ii/blocksize,jj/blocksize) of A to T,
% transposing it as it is copied.
% ------------------------------------------------------
for i = 1:min(blocksize, n-ii+1)
for j = 1:min(blocksize, n-jj+1)
T(j,i) = A(ii+i-1, jj+j-1)
end for
end for
% ------------------------------
% For each column of C and B ...
% ------------------------------
for k = 1:n
% --------------------------------------------------------
% ... compute product of block (ii/blocksize,jj/blocksize)
% of A with the k-th column of B
% --------------------------------------------------------
for i = 1:min(blocksize, n-ii+1)
temp = 0.0d0
for j = 1:min(blocksize, n-jj+1)
temp = temp + T(j,i)*B(jj+j-1, k)
end for
C(i+ii-1,k) = temp
end for
end for
end for % jj loop
end for % ii loop
with the corresponding picture:
The code above is almost runnable Matlab. It can be expressed more succintly using Matlab index vectors - see the fuller Matlab function. Blocking enhances data locality, but not pipelining. To improve pipelining, the idea is to keep the same kind of blocking (and so retain its benefits), and do loop unrolling within the blocks. We can do this by computing a small 2x2 sub-block of entries within a given block of C:
for ii = 1:blocksize:n
for jj = 1:blocksize:n
% ------------------------------------------------------
% Copy over block (ii/blocksize,jj/blocksize) of A to T,
% transposing it as it is copied.
% ------------------------------------------------------
for i = 1:min(blocksize, n-ii+1)
for j = 1:min(blocksize, n-jj+1)
T(j,i) = a(ii+i-1, jj+j-1)
end for
end for
% ---------------------------
% Compute a 2x2 subblock in C
% ---------------------------
for k = 1:2:n
for i = 1:2:min(blocksize, n-ii+1)
t11 = 0.0d0
t21 = 0.0d0
t12 = 0.0d0
t22 = 0.0d0
for j = 1:min(blocksize, n-jj+1)
t11 = t11 + T(j,i)*b(jj+j-1, k)
t21 = t21 + T(j,i+1)*b(jj+j-1, k)
t12 = t12 + T(j,i)*b(jj+j-1, k+1)
t22 = t22 + T(j,i+1)*b(jj+j-1, k+1)
end for
c(i+ii-1,k) = t11
c(i+ii,k) = t21
c(i+ii-1,k+1) = t12
c(i+ii,k+1) = t22
end for
end for
The code fragment above assumes that the matrix order n is
evenly divisible by 2. It requires clean up
(AKA preconditioning) loops to be correct.
As an informal exercise to help your understanding of the operations
going on, you might try writing the version valid for matrices that
are not of even size.
Just beware that there are several cases to consider for the temporary
block T, however when mod(n,blocksize) is not 0.
Also note:
Performance peters out around n = 360 for the 3Loop version. For that size of n, the data involved is
which closely matches the 1 Mbyte = 1048576 byte size of the second level cache on the machine used. The second level cache is a combined data and instruction cache, so we would expect the performance to drop slightly before that size is actually reached. There are also the references to C(i,j) in the code that are outside the innermost loop but still would involve some cachelines being brought in.
There is a library function for doing matrix-matrix multiplication on every architecture - including laptops running the Linux operating system. It is called dgemm(), and its performance is shown in black with the other four versions.
The optimized BLAS library on the system uses multiple cores, even if you specify that only one thread should be run. It does so well that the other four implementations are squashed together down at the bottom. However, using multiple cores doesn't give a fair comparison to implementations using only one.
Note that the blocked version has good performance, sustained well past the cache size (a matrix order of 1500 corresponds to data of about 17 Mbytes). The unrolled version retains those benefits, and gets an extra 15-20 Mflop boost as well.
The blocked, unrolled version gets around 80% of the performance of the highly optimized one from the vendor. This is typical; you can achieve about 80% of the peak performance of the machine you are using from a C/C++ or Fortran code. And subject to one parameter (the blocksize that lets three subblocks fit into cache), that C or Fortran code is portable across machines.
Since the multicore versus single core issue helps cloud the results, here's results from other systems:
On older Sun machines results differ significantly:
Finally, the results on a (slightly) more modern processor, the 64 bit Opteron using the PGI compiler and library:
The feature to note here is that PGI must have used a unoptimized triple-loop version for their dgemm, because it is down near the bottom with 3loop. Most importantly in all of these cases the blocked versions get sustained performance well beyond the cache size on the different machines. This is what the BLAS load-store analysis tells us is possible.
An interesting recent direction of research in this area is for libraries of algorithms like matrix-matrix multiplication which adaptively optimize such parameters for whatever architecture and application they are being used with. Such systems include ATLAS at Tennessee and PhiPAC at Berkeley. Also, because of unique capabilities in C++ some projects target "self-optimizing libraries" which dynamically at run-time find the correct parameters with minimal overhead.
Dense matrix-matrix multiplication is of little research interest. So why on earth did we spend a too-long web page on it? Dense matrix-matrix multiplication is a "best-case" problem, in that if you cannot get high performance with it, then your approach is probably not going help in other, more interesting, computational kernels. More importantly, the techniques used for getting and modeling performance, are useful more widely, not just in linear algebraic operations.
Next page: Using the BLAS in a code