Solving Ax = b then becomes LUx = Pb, since PAx=Pb. Three steps can be taken in solving LUx = Pb.
i: 1 2 3 4 5 6 7 8 pi : 3 7 5 8 4 1 2 6gives y = (x3, x7, x5, x8, x4, x1, x2, x6)T. We can compute y by
for i = 1:n
y(i) = x(p(i))
end for i
The corresponding permutation matrix is
[0 0 1 0 0 0 0 0]
|0 0 0 0 0 0 1 0|
|0 0 0 0 1 0 0 0|
P = |0 0 0 0 0 0 0 1|
|0 0 0 1 0 0 0 0|
|1 0 0 0 0 0 0 0|
|0 1 0 0 0 0 0 0|
[0 0 0 0 0 1 0 0]
and clearly no one in their right mind would represent it
using an array with n2 = 64 elements!
OK, maybe you would for 8x8 matrices ... but for 1000 x 1000 matrices,
you are entering territory where institutionalization is
recommended.
The second way of representing permutation matrices P is with pivot vectors, and it is these which we use in Gaussian elimination. This is an integer array piv of length n, applied to a vector x in y = Px using
y = x
for k = 1:n
swap y(k) and y(piv(k)) in the vector y
end for
Of course, this is better done with overwriting - you don't need
to use another vector y.
Now for quick reality check:
Storage: we could store L by rows in what is called packed storage :
L: [l11 l21 l22 l31 l32 l33 l41 l42 l43 l44 ... ]so that the matrix element lij is stored in the array location L(j + i(i-1)/2). However, for Gaussian elimination both L and U need to be stored, and generally A is just overwritten with them (this works since L is unit lower triangular and so its main diagonal need not be stored, allowing U to use the main diagonal storage.) In that case lij is stored in A(i,j). That makes indexing into the array storing L easier.
Now let's solve a particular lower triangular system:
[ 2 0 0 0 ] (x1) ( -2)
| 1 3 0 0 | (x2) ( 2)
| -3 2 5 0 | (x3) = ( 15)
[ 1 1 1 1 ] (x4) ( 0)
Everybody would begin with 2 x1 = -2 giving x1 = -1,
and then solve in order for x2, x3, ....
Two possibilities occur next:
Plug in already-known values as you need them in order to
solve for xi, or when you find a value, plug it into all
the remaining equations before going on to find the next value. Gives
two algorithms:
for i = 1:n
for j = 1:i-1
b(i) = b(i) - l(i,j)*x(j)
end for j
x(i) = b(i)/l(i,i)
end for i
This version has an inner product as innermost loop and
accesses rows of L. The second version is
for j = 1:n
x(j) = b(j)/l(j,j)
for i = j+1:n
b(i) = b(i) - l(i,j)*x(j)
end for i
end for j
This version has a daxpy as innermost loop (bad), and
accesses columns of L.
Algorithm 1 is sometimes called a row sweep algorithm, and Algorithm 2 is called a column sweep algorithm. Both algorithms have a block version possible by simply treating lij as an m x m block Lij. Then
for i = 1:n
for j = 1:i-1
b(i) = b(i) - L(i,j)*x(j)
end for j
solve L(i,i) x(i) = b(i) for x(i).
end for i
for j = 1:n
solve L(j,j) x(j) = b(j) for x(j)
for i = j+1:n
b(i) = b(i) - L(i,j)*x(j)
end for i
end for j
OK, by now you should know the drill: what kind of computational kernel is
a triangular solve? What is the memory reference to flop ratio? Should
we expect near peak performance or near basement performance on a
high-performance machine?
The minimal number of memory references are to read L and b and write x. This gives n(n+1)/2 + n + n = (n2 + 5n)/2 memory references. The number of flops involved is sumj=1:n[2*(n-j) + 1] = n2. So the ratio of memory references to flops is 1/2 + (5/2)n, a typical BLAS-2 number.
Performing triangular solves is not too hard, and we won't spend much time optimizing them. The reason is simple: computing the LU factorization requires (2/3)n3 flops, while the triangular solves require only 4n2 (2n2 each for L and U). So our efforts should go for the low-hanging fruit first: LU factorization.
Next page: Pivoting