Gaussian Elimination (LU Factorization) and the BLAS



Gaussian Elimination

The most commonly used direct method for solving general linear systems is Gaussian elimination with partial pivoting, which in modern terms is called LU decomposition with pivoting, or LU factorization with pivoting. You probably learned Gaussian elimination by adjoining the right hand side vector b to the matrix, then performing row combinations that would zero out the subdiagonal entries of the matrix A. LU decomposition (or "LU factorization") does the same operations, but ignores the right hand side vector until after the matrix has been processed. In particular: Let A be an n x n matrix, nonsingular. Gaussian elimination with partial pivoting gives the factorization PA = LU, where

Solving Ax = b then becomes LUx = Pb, since PAx=Pb. Three steps can be taken in solving LUx = Pb.

  1. Set d = Pb, either by shuffling entries of b, or by accessing via indirect addressing using a permutation vector. In setting d = Pb, we can sometimes overwrite b with its permuted entries ... depending on how P is represented.
  2. Solve Ly = d (unit lower triangular system)
  3. Solve Ux = y (upper triangular system)
We look later in some detail at all three stages. However, note that the triangular solves can be implemented by overwriting the right hand side vectors with the solution as we go along, rather than introducing new vectors y and d. In that case the three steps look more parsimonious with space:
  1. Set x = Pb
  2. Solve Lx = x (unit lower triangular system)
  3. Solve Ux = x (upper triangular system)
and if the original right hand side vector b is not needed later on, x can be replaced with b in all three steps.

LU Factorization, Part 1

The classical algorithm you probably learned (and need to unlearn) follows a three phase sequence on each step: find the pivot, interchange rows to bring the pivot to the diagonal position, scale the subdiagonal entries in the column, then update the rest of the matrix. Carrying out these operations with overwriting of the array A gives the L and U factors in the corresponding parts of A (since L is unit lower triangular, we don't store its diagonal, so everything fits tidily). Less obscurely, the algorithm without pivoting is:

LU Factorization: Version 1

for k = 1:n-1
  for i = k+1:n
      A(i,k) = A(i,k)/A(k,k)
  end for
  for i = k+1:n
      for j = k+1:n
        A(i,j) = A(i,j) - A(i,k)* A(k,j)
      end for 
  end for
end for 
We can tighten up the notation a little by using Matlab notation. Recall
  1. A(r:s,i) refers to rows r through s in column i,
  2. A(r,i:j) refers to columns i through j in row r,
  3. A(r:s,i:j) refers to rows r through s in columns i through j.
Note that (1) is a column vector, (2) is a row vector, while (3) is a 2D submatrix. Now the algorithm above can be stated more succintly as

LU Factorization: Rank-1 Update Form

for k = 1:n-1
  A(k+1:n,k) = A(k+1:n,k)/A(k,k)
  A(k+1:n,k+1:n) = A(k+1:n,k+1:n) - A(k+1:n,k)* A(k,k+1:n)
end for 

In fact, we can tighten up the notation even more in Matlab using indexing vectors :

for k = 1:n-1
  K = k+1:n
  A(K,k) = A(K,k)/A(k,k)
  A(K,K) = A(K,K) - A(K,k)* A(k,K)
end for 
Notice that this implies Matlab is case-sensitive (it is), and follows a useful convention: use upper case letters for vectors used for indexing, and lower case letters for scalars used for indexing.

In either of the last two forms, this more clearly shows what the operations involved are. Each iteration is a vector scaling of the subdiagonal part of column k, followed by a rank-1 update of the trailing part of the matrix. Graphically, the LU rank-1 update form procedure is:

LU; Rank-1

The rank-1 update stinks as a computational kernel ... check its memory reference to flop ratio. A more efficient version comes via a common idea in computer science (where it is called lazy evaluation). In numerical linear algebra it has a longer history, and is called "deferred updates". This means apply the updating information to a part of the matrix only when it is actually needed. The algorithm results by equating parts of A, L, and U in a partitioned way. In pictures, we have

LU; Derive Mat-vec

The diagonal entry in column k of L is known to be one, of course. Note that by sequencing the finding of g, u', and l as shown above defines an algorithm which recursively works on the trailing subblock. The resulting algorithm is

LU Factorization: Matrix-Vector Product Form

for k = 1:n-1
   A(k:n,k) = A(k:n,k) - A(k:n,1:k-1)*A(1:k-1,k)
   A(k,k+1:n) = A(k,k+1:n) - A(k,1:k-1)*A(1:k-1,k+1:n)
   A(k+1:n,k) = A(k+1:n,k)/A(k,k)
end for
The computational kernels are (in order) matrix-vector product, vector-matrix product, and vector scaling. So the rank-1 update has been replaced by matrix-vector products, which have half the memory reference to flop ratio of a rank-1 update. This is a big win on a cache-based machine. Minor point performance-wise, but big in terms of correctness: there is a final update of A(n,n) which has been left off above. Figure out what it is and correct the algorithm.

Graphically, the LU matrix-vector product form procedure is:

LU; Mat-vec

A final note here: the algorithms above (rank-1 and matrix-vector product formulations) can be applied to an m by n rectangular matrix. If m > n, then L is lower trapezoidal and U is upper triangular. If m < n, then U is upper trapezoidal, and L is lower triangular.


Next page: LU Factorization and Required BLAS Operations