LU: pivoting and triangular solves


Recall that LU factorization with partial pivoting gives PA = LU, where

Given P, L, and U, solving Ax = b then becomes solving LUx = Pb, since PAx=Pb. Three steps are used in solving for the vector x in LUx = Pb.

  1. Set d = Pb, either by shuffling entries of b, or by accessing via indirect addressing using a permutation vector. In setting d = Pb, b can sometimes be overwritten with its permuted entries ... depending on how P is represented.

  2. Solve Ly = d (unit lower triangular system)

  3. Solve Ux = y (upper triangular system)

The most computationally intensive part of solving a linear system via LU is finding the factors. However, handling permuatations and triangular solves is also required so those are analyzed here.


Permutation Matrices

Never store or manipulate permutations as matrices, even in Matlab. Matlab has the idea of a "psychologically lower triangular" matrix which incorporates the row permutation in the matrix L. Do "help lu" in Matlab to see what it is. For non-Matlab codes, use one of the two storage schemes which use just a single integer vector. One is to use a permutation vector the other is to use a pivot vector. The first stores P as a set of integers pi that represent position of xi in y = Px:
i:      1   2   3   4   5   6   7   8
pi :    3   7   5   8   4   1   2   6
gives y = (x3, x7, x5, x8, x4, x1, x2, x6)T. When a permutation vector is used, y can be computed by
   for i = 1:n
       y(i) = x(p(i))
   end for i

The corresponding permutation matrix is
       [0  0  1  0  0  0  0  0]
       |0  0  0  0  0  0  1  0|
       |0  0  0  0  1  0  0  0|
  P =  |0  0  0  0  0  0  0  1|
       |0  0  0  1  0  0  0  0|
       |1  0  0  0  0  0  0  0|
       |0  1  0  0  0  0  0  0|
       [0  0  0  0  0  1  0  0]
and clearly no one in their right mind would represent it using an array with n2 = 64 elements. OK, maybe you would for 8×8 matrices ... but for 12000×12000 matrices, you are entering territory where institutionalization is recommended.

The second way of storing a permutation matrix P is with pivot vectors, and LU factorization uses those. The data structure used is an integer array piv of length n, which can be applied to a vector x in y = Px using


     y = x
     for k = 1:n
         swap y(k) and y(piv(k)) in the vector y
     end for 

Of course, this is better done with overwriting - an additional vector y is not needed.


A quick reality check ...

... or at least as real as computational science can get.


Triangular Systems, Part 1

Given the LU factors and the pivot array, two triangular systems need to be solved. Consider lower triangular systems; upper triangular systems can be handled mutatis mutandi, and it is a great exercise for you to modify all the stuff used on lower triangular systems to the upper triangular case. Also, in Gaussian elimination the lower triangular matrices are always unit lower triangular, meaning that they have ones on the diagonal. That also implies some modifications of the following - modifications which are trivial.

Storage: L can be stored by rows in what is called packed storage :

L:  [l11  l21 l22  l31 l32 l33  l41 l42 l43 l44 ... ]
so that the matrix element lij is stored in the array location L(j + i(i−1)/2). However, for LU factorization both L and U need to be stored, and A is just overwritten with them (this works since L is unit lower triangular and so its main diagonal need not be stored, allowing U to use the main diagonal storage.) In that case lij is stored in A(i,j). That makes indexing into the array that stores L easier.

As an example, a particular lower triangular system is:

    [ 2    0   0   0 ]  (x1)     ( −2)
    | 1    3   0   0 |  (x2)     (  2)   
    | −3   2   5   0 |  (x3)  =  ( 15)   
    [ 1    1   1   1 ]  (x4)     (  0)  
Everybody would begin with 2*x1 = −2 giving x1 = −1, and then solve in order for x2, x3, .... Two possibilities occur next: Plug in already-known values as you need them to solve for xi, or when you find a value, plug it into all the remaining equations before going on to find the next value. Gives two algorithms:

Row-oriented Lower Triangular Solve:


for i = 1:n
    for j = 1:i−1
        b(i) = b(i) − l(i,j)*x(j)
    end for j
    x(i) = b(i)/l(i,i)
end for i

This version has an inner product as innermost loop and accesses rows of L. The second version is

Column-oriented Lower Triangular Solve:


for j = 1:n
    x(j) = b(j)/l(j,j)
    for i = j+1:n
        b(i) = b(i) − l(i,j)*x(j)
    end for i
end for j

This second version has a daxpy as innermost loop (bad), and accesses columns of L.

Algorithm 1 is sometimes called a row sweep, and Algorithm 2 is called a column sweep. Both have block versions possible by simply treating lij as an m × m block Lij. Then

Block Row-oriented Lower Triangular Solve:


for i = 1:n
    for j = 1:i−1
        b(i) = b(i) − L(i,j)*x(j)
    end for j
    solve L(i,i) x(i) = b(i) for x(i).
end for i

Block Column-oriented Lower Triangular Solve:


for j = 1:n
    solve L(j,j) x(j) = b(j) for x(j)
    for i = j+1:n
        b(i) = b(i) − L(i,j)*x(j)
    end for i
end for j

OK, by now you should know the drill: what level of BLAS is a triangular solve? What is the memory reference to flop ratio? Should near peak performance or near basement performance be expected on a modern machine?

The minimal number of memory references are to read L and b and write x. This gives n(n+1)/2 + n + n = (n2 + 5n)/2 memory references. The number of flops involved is sumj=1:n[2*(n−j) + 1] = n2. So the ratio of memory references to flops is 1/2 + (5/2)n, a typical BLAS2 number.

That also means the two block versions shown above are optimal implementations. They both are based on matrix-vector products, which itself has a load/store ratio of 1/2. This is vitally important: load/store analysis tells us there is no point in pursuing a more efficient version since no such version exists.

Performing triangular solves is easy, and not worth spending a lot of time optimizing them. The reason is simple: computing the LU factorization requires ≈ (2/3)n3 flops, while the triangular solves require only ≈ 4n2 (2n2 each for L and U). So optimization efforts should go for the low-hanging fruit first: LU factorization. The final stages in getting a good LU will turn out to require triangular solves, but with multiple right hand sides. That is, it will involve solving L*U = V, where V is an n × p matrix. That operation is a level-3 BLAS and worth optimizing ... but we'll let the BLAS take care of that one.


Next: Pivoting, and a BLAS-3 LU factorization