P573: LU Factorization

LU: pivoting and triangular solves

Recall that LU factorization with partial pivoting gives PA = LU, where

P is a permutation matrix, that is, it has exactly one 1 in each row and column, and zeros elsewhere. P⁻¹ = P^T, and can be stored as a 1D integer array of length n.
L is unit lower triangular matrix (ones on main diagonal, zero above main diagonal)
U is upper triangular matrix (zeros below the main diagonal)

Given P, L, and U, solving Ax = b then becomes solving LUx = Pb, since PAx=Pb. Three steps are used in solving for the vector x in LUx = Pb.

Set d = Pb, either by shuffling entries of b, or by accessing via indirect addressing using a permutation vector. In setting d = Pb, b can sometimes be overwritten with its permuted entries ... depending on how P is represented.
Solve Ly = d (unit lower triangular system)
Solve Ux = y (upper triangular system)

The most computationally intensive part of solving a linear system via LU is finding the factors. However, handling permuatations and triangular solves is also required so those are analyzed here.

Permutation Matrices

Never store or manipulate permutations as matrices, even in Matlab. Matlab has the idea of a "psychologically lower triangular" matrix which incorporates the row permutation in the matrix L. Do "help lu" in Matlab to see what it is. For non-Matlab codes, use one of the two storage schemes which use just a single integer vector. One is to use a permutation vector the other is to use a pivot vector. The first stores P as a set of integers p_i that represent position of x_i in y = Px:

i:      1   2   3   4   5   6   7   8
p_i :    3   7   5   8   4   1   2   6

gives y = (x₃, x₇, x₅, x₈, x₄, x₁, x₂, x₆)^T. When a permutation vector is used, y can be computed by

   for i = 1:n
       y(i) = x(p(i))
   end for i

The corresponding permutation matrix is

       [0  0  1  0  0  0  0  0]
       |0  0  0  0  0  0  1  0|
       |0  0  0  0  1  0  0  0|
  P =  |0  0  0  0  0  0  0  1|
       |0  0  0  1  0  0  0  0|
       |1  0  0  0  0  0  0  0|
       |0  1  0  0  0  0  0  0|
       [0  0  0  0  0  1  0  0]

and clearly no one in their right mind would represent it using an array with n² = 64 elements. OK, maybe you would for 8×8 matrices ... but for 12000×12000 matrices, you are entering territory where institutionalization is recommended.

The second way of storing a permutation matrix P is with pivot vectors, and LU factorization uses those. The data structure used is an integer array piv of length n, which can be applied to a vector x in y = Px using

     y = x
     for k = 1:n
         swap y(k) and y(piv(k)) in the vector y
     end for

Of course, this is better done with overwriting - an additional vector y is not needed.

A quick reality check ...

... or at least as real as computational science can get.

Given y, how should x = P^Ty be computed using either permutation or pivoting vectors?
Given x and a permutation vector, how to permute x "in place", without requiring another array y? [Hint: you can use sign of p_i to mark during permutation whether or not x_i contains its original or permuted entry].
Permutations are no problem in floating point arithmetic - why?

Triangular Systems, Part 1

Given the LU factors and the pivot array, two triangular systems need to be solved. Consider lower triangular systems; upper triangular systems can be handled mutatis mutandi, and it is a great exercise for you to modify all the stuff used on lower triangular systems to the upper triangular case. Also, in Gaussian elimination the lower triangular matrices are always unit lower triangular, meaning that they have ones on the diagonal. That also implies some modifications of the following - modifications which are trivial.

Storage: L can be stored by rows in what is called packed storage :

L:  [l₁₁  l₂₁ l₂₂  l₃₁ l₃₂ l₃₃  l₄₁ l₄₂ l₄₃ l₄₄ ... ]

so that the matrix element l_ij is stored in the array location L(j + i(i−1)/2). However, for LU factorization both L and U need to be stored, and A is just overwritten with them (this works since L is unit lower triangular and so its main diagonal need not be stored, allowing U to use the main diagonal storage.) In that case l_ij is stored in A(i,j). That makes indexing into the array that stores L easier.

As an example, a particular lower triangular system is:

    [ 2    0   0   0 ]  (x₁)     ( −2)
    | 1    3   0   0 |  (x₂)     (  2)   
    | −3   2   5   0 |  (x₃)  =  ( 15)   
    [ 1    1   1   1 ]  (x₄)     (  0)

Everybody would begin with 2*x₁ = −2 giving x₁ = −1, and then solve in order for x₂, x₃, .... Two possibilities occur next: Plug in already-known values as you need them to solve for x_i, or when you find a value, plug it into all the remaining equations before going on to find the next value. Gives two algorithms:

Row-oriented Lower Triangular Solve:

for i = 1:n
    for j = 1:i−1
        b(i) = b(i) − l(i,j)*x(j)
    end for j
    x(i) = b(i)/l(i,i)
end for i

This version has an inner product as innermost loop and accesses rows of L. The second version is

Column-oriented Lower Triangular Solve:

for j = 1:n
    x(j) = b(j)/l(j,j)
    for i = j+1:n
        b(i) = b(i) − l(i,j)*x(j)
    end for i
end for j

This second version has a daxpy as innermost loop (bad), and accesses columns of L.

Algorithm 1 is sometimes called a row sweep, and Algorithm 2 is called a column sweep. Both have block versions possible by simply treating l_ij as an m × m block L_ij. Then

x_j = b_j/ l_jj is replaced by: solve L_jj x_j = b_j, where x_j, b_j are now vectors.
b_i = b_i − L_ij x_j becomes a matrix*vector (BLAS-2) operation.

Block Row-oriented Lower Triangular Solve:

for i = 1:n
    for j = 1:i−1
        b(i) = b(i) − L(i,j)*x(j)
    end for j
    solve L(i,i) x(i) = b(i) for x(i).
end for i

Block Column-oriented Lower Triangular Solve:

for j = 1:n
    solve L(j,j) x(j) = b(j) for x(j)
    for i = j+1:n
        b(i) = b(i) − L(i,j)*x(j)
    end for i
end for j

OK, by now you should know the drill: what level of BLAS is a triangular solve? What is the memory reference to flop ratio? Should near peak performance or near basement performance be expected on a modern machine?

The minimal number of memory references are to read L and b and write x. This gives n(n+1)/2 + n + n = (n² + 5n)/2 memory references. The number of flops involved is sum_j=1:n[2*(n−j) + 1] = n². So the ratio of memory references to flops is 1/2 + (5/2)n, a typical BLAS2 number.

That also means the two block versions shown above are optimal implementations. They both are based on matrix-vector products, which itself has a load/store ratio of 1/2. This is vitally important: load/store analysis tells us there is no point in pursuing a more efficient version since no such version exists.

Performing triangular solves is easy, and not worth spending a lot of time optimizing them. The reason is simple: computing the LU factorization requires ≈ (2/3)n³ flops, while the triangular solves require only ≈ 4n² (2n² each for L and U). So optimization efforts should go for the low-hanging fruit first: LU factorization. The final stages in getting a good LU will turn out to require triangular solves, but with multiple right hand sides. That is, it will involve solving L*U = V, where V is an n × p matrix. That operation is a level-3 BLAS and worth optimizing ... but we'll let the BLAS take care of that one.

Next: Pivoting, and a BLAS-3 LU factorization

Last Modified: Mon 04 Nov 2019, 06:59 AM