Partial Pivoting


Pivoting in LU Factorization

When using either of the two versions (rank-1 update and matrix-vector product) of LU factorization, the division by A(k,k) in the scaling operation can cause problems if A(k,k) is small in absolute value. Pivoting means bringing a larger element into that position by swapping rows or columns in the matrix. Partial pivoting means doing only row swaps; LU factorization is "usually stable" when partial pivoting is used. However, there are classes of problems (some two point boundary value problems in ODE's is one example) for which it can fail. For the rank-1 version this gives

LU Factorization: Rank-1 Update with Pivoting Form

for k = 1:n
  p = index of max element in |A(k:n,k)|
  piv(k) = p
  swap rows k and p of A: A(k,:) <---> A(p,:)
  A(k+1:n,k) = A(k+1:n,k)/A(k,k)
  A(k+1:n,k+1:n) = A(k+1:n,k+1:n) - A(k+1:n,k)* A(k,k+1:n)
end for 
Note that the first step picks out the largest element in the k-th column among the remaining unreduced part of the matrix. Also, the loop index runs to n, so that piv(n) will be defined - of course, that is just n (why?). Also, note that we swap the entire rows k and p of the array A; this will also swap the corresponding parts of L and U since we are overwriting A with those factors. It is another magic feature of LU factorization that this makes the factors come out "in the right order" as well.

Next, here is a matrix-vector product formulation with pivoting, but stated for an m by n matrix B, with m >= n.

LU Factorization: Matrix-Vector Product with Pivoting for mxn Matrix

for k = 1:n-1
   B(k:m,k) = B(k:m,k) - B(k:m,1:k-1)*B(1:k-1,k)
   p = index of max element in |B(k:m,k)|
   piv(k) = p
   swap rows k and p of B: B(k,:) <---> B(p,:)
   B(k,k+1:n) = B(k,k+1:n) - B(k,1:k-1)*B(1:k-1,k+1:n)
   B(k+1:m,k) = B(k+1:m,k)/B(k,k)
end for
B(n:m,n) = B(n:m,n) - B(n:m,1:n-1)*B(1:n-1,n)
B(n+1:m,n) = B(n+1:m,n)/B(n,n)
Note the last two steps are new ingredients - they consist of applying the remaining updates to the rest of the matrix. Also note that we can do another pivot step in case B(n,n) is small.

LU Factorization, Part 2

Moving from the rank-1 update version of LU factorization to a matrix-vector version was done to get a better BLAS-2 kernel (in terms of load/store analysis) as the workhorse. But since the best BLAS kernel is matrix-matrix multiplication, we should try to get a BLAS-3 version. This can be done by blocking.

Blocked versions of LU factorization can be derived in the same way as was done to get the matrix-vector product version. Partition the matrix into block columns, each of width v. We can find v columns of L and v rows of U on each "iteration". Suppose that k columns/rows have already been found. The picture is

LU; Block mat-vec

where the unknowns are shown in red and the known values in blue. This is the same as the last diagram, but with block columns replacing the unknown row/column to find at this step. In more detail showing the sizes and shapes of the systems,

LU; Block mat-vec

Sequencing the known/unknowns in order gives:

LU; Block mat-vec
LU; Block mat-vec

The precise algorithm statement is left as an unassigned exercise. You should implement your resulting algorithm in Matlab to verify its correctness.

The reason for using a matrix-vector product earlier was that it gave better computational kernels. However, when using a blocked version the rank-1 update becomes a rank-v update and so becomes matrix-matrix multiplication. That kernel is the best amongst the BLAS in terms of its ratio of memory references to flops, so we can safely use it in a high-performance version of LU factorization, instead of the version above based on the blocked matrix-vector version. This also gives you a good review question to make sure you have understood things so far: write out the block version of the matrix*vector based decomposition.

So partition the matrix into block columns, each of width v. The last block column, however, may be of smaller width if v does not divide n evenly. For that reason the following algorithm sets up vb as the actual block size of the current block column.

Another point: as with the block matrix-vector (which is really a matrix-matrix version as well) shown above, the step following the scaling of subdiagonal entries consists of multiplying the following rows of the unreduced parts of A using those scaled entries. When we go to a block version, the equivalent is to apply a lower triangular L solve on the corresponding block row, as happened above. That is indicated below by the Matlab tril(*,-1) operator returning the strictly lower triangular part of its argument. That is added to the identity matrix since L is unit lower triangular. The triangular solve is applied with overwriting to the entire block row.

LU Factorization: Block Version with Rank-v Updates

for k = 1:n in steps of v
  1. vb = min(v, n-k+1)
  2. Apply LU with pivoting to A(k:n,k:k+vb-1) [Overwrite A(k:n,k:k+vb-1) with its LU factors, return local pivots lpiv()]
  3. piv(k:k+vb-1) = lpiv(1:vb) + k - 1 [Adjust local pivots to global numbering]
  4. for i = k:k+vb-1
    1. swap A(i,1:k-1) <---> A(piv(i),1:k-1)
    2. swap A(i,k+vb:n) <---> A(piv(i),k+vb:n)
    end for [pivots columns not already done in LU on current block col]
  5. Solve [I + tril(A(k:k+vb-1, k:k+vb-1),-1)] X = A(k:k+vb-1, k+vb:n) [Triangular solve with matrix right hand side, solved with overwriting of rhs]
  6. A(k+vb:n,k+vb:n) = A(k+vb:n,k+vb:n) - A(k+vb:n,k:k+vb-1)* A(k:k+vb-1,k+vb:n)
end for

Note that the second step is simply a call to LU factorization with pivoting, applied to the rectangular matrix A(k:n,k:k+b-1). This is why we stated the matrix-vector product form of LU factorization above for a rectangular matrix - so it could be used as a function here. Step four is two calls to the BLAS routine dswap(), step 5 a call to dtrsm(), and step 6 a call to dgemm().

Before diving in and trying to implement the above algorithm, you should first draw the corresponding block pictures as was done above for the other versions.

Finally, all the above does is to produce the LU factorization. A complete solver will need to also do the triangular solves (trivially done with a BLAS library: two calls to dtrsv() ). And what do we do if there is a zero or very small pivot entry encountered? Quit, keep going, go to full pivoting? In terms of implementing a scientific computing library, you should return to the user enough information so that the user can make that difficult decision - not your code.

Reality/Understanding Checkpoint

Note that I drew pictures above for the BLAS-3 (matrix-matrix) version of LU factorization, based on the BLAS-2 matrix-vector product version. Then I gave the detailed algorithm for a BLAS-3 version of LU factorization based on the BLAS-2 rank-1 update version. You should try to
  1. write the detailed algorithm for the BLAS-3 version based on the matrix-vector product version (that is, based on the pictures)
  2. draw the pictures that correspond to the BLAS-3 version based on the rank-1 update version (that is, based on the last algorithm above).

Summary

Nowadays there is little point in trying to write a better LU factorization for dense linear systems.
LAPACK does it better, more accurately, and prettier than the rest of us can. However,
  1. Sparse matrices are ones where the coefficient matrix consists mainly of zeros. For those, more sophisticated data structures are used to store the matrix than 2D arrays. Moving the computational kernels up the BLAS food chain to get higher performance is still an active area of research.
  2. Your computation kernels for your applications may not have Gaussian elimination, but can benefit from the same analysis and improvement. The load-store, mem ref to flop ratio type of analysis is much broader and important than the actual examples chosen to illustrate it.
  3. Sometimes you will need to dig into the innards of LAPACK to specialize it for some application, or to extract some special info. And you won't be able to do that without first knowing what the principles are.


  • Next page: Error bounds for LU factorization