Block-Householder QR factorization
As usual, the key to a BLAS-3 version is to use deferred updates, in this
case applied to the Householder QR factorization. Let
Qi be the ith partial product for the H-reflectors and
suppose it can be represented using two matrices (not vectors)
I − VI UIT:
Qi = Hi ... H2 H1
= I − VI UIT
The recursion defining the next Q is
Qi+1 = Hi+1 Qi
= (I − wi+1wi+1T)(I − ViUiT)
= Hi+1 − Hi+1 Vi UiT
= I − wiwiT
− Hi+1 Vi UiT
= I − [Hi+1 Vi wi]
[ Ui wii]T
= I − [Vi wi+1]
[ Ui QiTwi+1]
Let ν (that is a "nu", not "vee") be the chosen block size.
So a block of transforms can be built up for a block column C of A by:
C = [c1, c2, ... , cν]
U = [ ], V = [ ]
for j = 1:ν
Find Householder wj s.t. Hj cj is zero in j+1:n
V = [ Hj*V, wj]
U = [ U, wj]
if (j < ν), cj+1 = (I − VUT)cj+1.
A complete algorithm for the case where the block size ν evenly divides n is:
A = [A1, A2, ... , Ap], where p*ν = n.
for k = 1:p
Set i = (k−1)*ν + 1
Find U, V for C = A(i:m,i:i+ν−1)
A = A(i:m,i+ν:1) = (I − VUT) A(i:m,i+ν:1)
The way the block algorithm is stated, it requires two temporary arrays U and V,
which can be of size up to m×ν. In practice only one of the two needs
to be allocated and the other is stored in the corresponding part of A.
Also, how should the block size ν be chosen? The same as the value that would
give maximal computational rate in matrix*matrix multiply, or something else?
A professional implementation could take into account the decreasing sizes of
the arrays being used as the factorization proceeds and use a small ν at the
start, then increase it as the columns in U and V become shorter.
What is the maximal performance boost that could be expected from doing that?
If small, then why bother? If large, how much time should be allocated to coding
that more dynamic version?
All of those questions can be answered with the tools you now have:
- Identify what different BLAS operations are used in each step above
- Compute the memory reference to flop ratio of each of those kernels
- Determine the effective size of the cache(s) on the target machine
- Find approximately what the optimal block size is for a matrix*matrix
operation on the target machine
The first two require no coding at all, and you should be able to knock those off
in an hour. Their information also will not depend on which normalization
or storage scheme used for the Housholder reflectors.
Next: Singular value factorization
Started: Mon 23 Aug 2010, 06:00 PM
Modified: Tue 09 Nov 2010, 12:48 PM to add in the Q_1 versus U_1 note
Last Modified: Mon 04 Nov 2019, 07:19 AM