P573: BLAS-3 LU Factorization

A BLAS-3 Version of LU Factorization

Changing from the rank-1 update version of LU factorization to a matrix-vector version was done to get a better BLAS-2 kernel (in terms of load/store analysis) as the workhorse. But since the best BLAS kernel is matrix-matrix multiplication, we should try to get a BLAS-3 version. This can be done by blocking.

Block versions of LU factorization are derived in the same way as was done to get the matrix-vector product version. Partition the matrix into block columns, each of width v. We can find v columns of L and v rows of U on each "iteration". Suppose that k columns/rows have already been found. The picture is

where the unknowns are shown in red and the known values in blue. This is the same as the last diagram, but with block columns replacing the unknown row/column to find at this step. In more detail showing the sizes and shapes of the systems,

Sequencing the known/unknowns in order gives:

The precise algorithm statement is left as an unassigned exercise. You should implement your resulting algorithm in Matlab to verify its correctness.

The reason for using a matrix-vector product earlier was that it gave better computational kernels. However, when using a block version the rank-1 update becomes a rank-v update and so becomes matrix-matrix multiplication. That kernel is the best amongst the BLAS in terms of its ratio of memory references to flops, so we can safely use it in a high-performance version of LU factorization, instead of the version above based on the block matrix-vector version. This also makes a good review question to make sure you have understood things so far: write out the block version of the matrix*vector based decomposition.

Statement of a BLAS-3 version

The BLAS-3 version partitions the matrix into block columns, each of width v. The last block column, however, may be smaller if v does not divide n evenly. For that reason the following algorithm sets up v_b as the actual block size of the current block column.

Another point: as with the block matrix-vector (which is really a matrix-matrix version as well) shown above, the step following the scaling of subdiagonal entries consists of multiplying the following rows of the unreduced parts of A using those scaled entries. When converting it to a block version, the equivalent is to apply a lower triangular L solve on the corresponding block row, as happened above. That is indicated below by the Matlab tril(*,−1) operator returning the strictly lower triangular part of its argument. That is added to the identity matrix since L is unit lower triangular. The triangular solve is applied with overwriting to the entire block row.

LU Factorization: Block Version with Rank-v Updates

for k = 1:n in steps of v

v_b = min(v, n−k+1)
Apply LU with pivoting to A(k:n,k:k+v_b−1) [Overwrite A(k:n,k:k+v_b−1) with its LU factors and return local pivots in array lpiv()]
piv(k:k+v_b−1) = lpiv(1:v_b) + k − 1 [Adjust local pivots to global numbering]
for i = k:k+v_b−1
1. swap A(i,1:k−1) ↔ A(piv(i),1:k−1)
2. swap A(i,k+v_b:n) ↔ A(piv(i),k+v_b:n)
end for [pivot columns not already done in LU on current block column]
Solve [I + tril(A(k:k+v_b−1, k:k+v_b−1),−1)] X = A(k:k+v_b−1, k+v_b:n) [Triangular solve with matrix right hand side and with overwriting of right hand side]
A(k+v_b:n,k+v_b:n) = A(k+v_b:n,k+v_b:n) − A(k+v_b:n,k:k+v_b−1) * A(k:k+v_b−1,k+v_b:n) [the BLAS-3 matrix-matrix multiply that this whole damned process was designed to achieve]

end for

Implementing this using the BLAS is relatively straightforward. Step 2 is a call to an LU factorization function with pivoting, applied to the rectangular array A(k:n,k:k+b−1). That in turn should be the matrix-vector product version of LU factorization. This is why it was stated for rectangular arrays earlier, so that it could be used as a subfunction here. Step 4 is a sequence of calls to the BLAS routine dswap() to swap rows in the columns that were not already swapped by the call to the BLAS-2 rectangular LU factorization. Step 5 is a call to dtrsm(), and step 6 is a call to dgemm().

Before diving in and trying to implement the above algorithm, first draw the corresponding block pictures as was done above for the other versions. The Matlab function LU8.m implements this, with the kludge of having to having to specify the lower triangular solves with multiple right hand sides using (eye(nub)+tril(A(cols,cols),-1)). A C or Fortran version will just call the BLAS routine dtrsm() passing along the relevant parts of the array A. The Matlab script testLU8.m will set up different arrays and call LU8.m. It uses a switch/case construct to choose what test case to create, and prompts the user for which one to select - so the script is useful as an example of how to do those things in Matlab.

The algorithms here just produce the LU factorization. A complete solver will need to also do the triangular solves (trivially done with a BLAS library: two calls to dtrsv() ). What should be done if a zero or very small pivot entry is encountered? Quit, keep going, or switch over to full pivoting? In terms of implementing a scientific computing library, the basic principle is to return to the user enough information so that person can make those difficult decisions.

The library LINPACK, and its modern successor LAPACK, separate the factorization from the solves partly to force the user to decide whether to proceed after the factorization finishes. Then the user can call the triangular solves to solve a system Ax = b for a particular right hand side vector b, or to use a different factorization if the LU one was not sufficiently accurate. Almost all good linear algebra libraries will return information like an approximation of the condition number of the matrix to help the user decide what to do next.

Another Reality/Understanding Checkpoint

A fast one has been pulled on you. The pictures above are for the BLAS-3 matrix-matrix version of LU factorization, based on the BLAS-2 matrix-vector product version. Then the detailed algorithm for a BLAS-3 version of LU factorization based on the BLAS-2 rank-1 update version was given. You should

write the detailed algorithm for the BLAS-3 version based on the matrix-vector product version (that is, based on the pictures),
draw the pictures that correspond to the BLAS-3 version based on the rank-1 update version (that is, based on the last algorithm above).

Summary

Nowadays there is little point in trying to write a better LU factorization for dense linear systems. LAPACK does it better, more accurately, and prettier than the rest of us can. Those libraries also provide at least two interfaces for something like an LU solver: a basic one that can be used easily by non-specialists, and an expert one that lets the user set parameters, obtain various error estimates, and otherwise guide the internal methods in the solver.

The real goal here is the process of using load-store analysis, and moving from an abysmal vector update or rank-1 update version to a nifty BLAS-3 matrix-matrix one. The important feature is being able to use load/store analysis to know if and when a better implementation exists, and to know how to achieve those better performing versions. More generally:

Your computation kernels for your applications may not have Gaussian elimination, but can benefit from the same analysis and improvement. The load-store, memory ref to flop ratio type of analysis is much broader and important than the actual examples chosen to illustrate it.
Sometimes you will need to dig into the innards of LAPACK to specialize it for some application, or to extract some special info. And you won't be able to do that without first knowing what the principles are.
Sparse systems are ones where the coefficient matrix consists mainly of zeros. For those, more sophisticated data structures are used to store the matrix. Moving the computational kernels up the BLAS food chain to get higher performance is still an area of research for sparse systems.

Next: Error bounds for LU factorization

Last Modified: Mon 04 Nov 2019, 07:01 AM