LU Factorization and Required BLAS Operations


Recall that Gaussian elimination with partial pivoting gives the factorization PA = LU, where

Solving Ax = b then becomes LUx = Pb, since PAx=Pb. Three steps can be taken in solving LUx = Pb.

  1. Set d = Pb, either by shuffling entries of b, or by accessing via indirect addressing using a permutation vector. In setting d = Pb, we can sometimes overwrite b with its permuted entries ... depending on how P is represented.
  2. Solve Ly = d (unit lower triangular system)
  3. Solve Ux = y (upper triangular system)

Permutation Matrices

Never store/manipulate these as matrices, even in Matlab. Instead there are two representations which use just a single integer vector. One is to use a permutation vector the other is to use a pivot vector. The first stores P as a set of integers pi that represent position of xi in y = Px:
i:      1   2   3   4   5   6   7   8
pi :    3   7   5   8   4   1   2   6
gives y = (x3, x7, x5, x8, x4, x1, x2, x6)T. We can compute y by
   for i = 1:n
       y(i) = x(p(i))
   end for i
The corresponding permutation matrix is
       [0  0  1  0  0  0  0  0]
       |0  0  0  0  0  0  1  0|
       |0  0  0  0  1  0  0  0|
  P =  |0  0  0  0  0  0  0  1|
       |0  0  0  1  0  0  0  0|
       |1  0  0  0  0  0  0  0|
       |0  1  0  0  0  0  0  0|
       [0  0  0  0  0  1  0  0]
and clearly no one in their right mind would represent it using an array with n2 = 64 elements! OK, maybe you would for 8x8 matrices ... but for 1000 x 1000 matrices, you are entering territory where institutionalization is recommended.

The second way of representing permutation matrices P is with pivot vectors, and it is these which we use in Gaussian elimination. This is an integer array piv of length n, applied to a vector x in y = Px using

     y = x
     for k = 1:n
         swap y(k) and y(piv(k)) in the vector y
     end for 
Of course, this is better done with overwriting - you don't need to use another vector y.

Now for quick reality check:


Triangular Systems, Part 1

Once we have the LU factors we need to solve two triangular systems. Consider lower triangular systems; upper triangular systems can be handled mutatis mutandi, and it is a great exercise for you to modify all the stuff used on lower triangular systems to the upper triangular case. Also, in Gaussian elimination the lower triangular matrices are always unit lower triangular, meaning that they have ones on the diagonal. That also implies some modifications of the following - modifications which are trivial.

Storage: we could store L by rows in what is called packed storage :

L:  [l11  l21 l22  l31 l32 l33  l41 l42 l43 l44 ... ]
so that the matrix element lij is stored in the array location L(j + i(i-1)/2). However, for Gaussian elimination both L and U need to be stored, and generally A is just overwritten with them (this works since L is unit lower triangular and so its main diagonal need not be stored, allowing U to use the main diagonal storage.) In that case lij is stored in A(i,j). That makes indexing into the array storing L easier.

Now let's solve a particular lower triangular system:

    [ 2    0   0   0 ]  (x1)     ( -2)
    | 1    3   0   0 |  (x2)     (  2)   
    | -3   2   5   0 |  (x3)  =  ( 15)   
    [ 1    1   1   1 ]  (x4)     (  0)  
Everybody would begin with 2 x1 = -2 giving x1 = -1, and then solve in order for x2, x3, .... Two possibilities occur next: Plug in already-known values as you need them in order to solve for xi, or when you find a value, plug it into all the remaining equations before going on to find the next value. Gives two algorithms:

Row-oriented Lower Triangular Solve

for i = 1:n
    for j = 1:i-1
        b(i) = b(i) - l(i,j)*x(j)
    end for j
    x(i) = b(i)/l(i,i)
end for i
This version has an inner product as innermost loop and accesses rows of L. The second version is

Column-oriented Lower Triangular Solve

for j = 1:n
    x(j) = b(j)/l(j,j)
    for i = j+1:n
        b(i) = b(i) - l(i,j)*x(j)
    end for i
end for j
This version has a daxpy as innermost loop (bad), and accesses columns of L.

Algorithm 1 is sometimes called a row sweep algorithm, and Algorithm 2 is called a column sweep algorithm. Both algorithms have a block version possible by simply treating lij as an m x m block Lij. Then

Block Row-oriented Lower Triangular Solve

for i = 1:n
    for j = 1:i-1
        b(i) = b(i) - L(i,j)*x(j)
    end for j
    solve L(i,i) x(i) = b(i) for x(i).
end for i

Block Column-oriented Lower Triangular Solve

for j = 1:n
    solve L(j,j) x(j) = b(j) for x(j)
    for i = j+1:n
        b(i) = b(i) - L(i,j)*x(j)
    end for i
end for j
OK, by now you should know the drill: what kind of computational kernel is a triangular solve? What is the memory reference to flop ratio? Should we expect near peak performance or near basement performance on a high-performance machine?

The minimal number of memory references are to read L and b and write x. This gives n(n+1)/2 + n + n = (n2 + 5n)/2 memory references. The number of flops involved is sumj=1:n[2*(n-j) + 1] = n2. So the ratio of memory references to flops is 1/2 + (5/2)n, a typical BLAS-2 number.

Performing triangular solves is not too hard, and we won't spend much time optimizing them. The reason is simple: computing the LU factorization requires (2/3)n3 flops, while the triangular solves require only 4n2 (2n2 each for L and U). So our efforts should go for the low-hanging fruit first: LU factorization.


Next page: Pivoting