Block-Householder QR factorization

As usual, the key to a BLAS-3 version is to use deferred updates, in this case applied to the Householder QR factorization. Let Q_i be the i^th partial product for the H-reflectors and suppose it can be represented using two matrices (not vectors) I − V_I U_I^T: Q_i = H_i ... H₂ H₁ = I − V_I U_I^T The recursion defining the next Q is
Q_i+1 = H_i+1 Q_i
         = (I − w_i+1w_i+1^T)(I − V_iU_i^T)
         = H_i+1 − H_i+1 V_i U_i^T
         = I − w_iw_i^T − H_i+1 V_i U_i^T
         = I − [H_i+1 V_i w_i] [ U_i w_ii]^T
         = I − [V_i w_i+1] [ U_i Q_i^Tw_i+1]

Let ν (that is a "nu", not "vee") be the chosen block size. So a block of transforms can be built up for a block column C of A by:

C = [c₁, c₂, ... , c_ν] 

U = [ ], V = [ ] 

for j = 1:ν
    Find Householder w_j s.t. H_j c_j is zero in j+1:n
    V = [ H_j*V, w_j]
    U = [ U, w_j]
    if (j < ν), c_j+1 = (I − VU^T)c_j+1.

A complete algorithm for the case where the block size ν evenly divides n is:

A = [A₁, A₂, ... , A_p], where p*ν = n.

for k = 1:p
    Set i  = (k−1)*ν + 1
    Find U, V for C = A(i:m,i:i+ν−1)
    A = A(i:m,i+ν:1) = (I − VU^T) A(i:m,i+ν:1)

The way the block algorithm is stated, it requires two temporary arrays U and V, which can be of size up to m×ν. In practice only one of the two needs to be allocated and the other is stored in the corresponding part of A. Also, how should the block size ν be chosen? The same as the value that would give maximal computational rate in matrix*matrix multiply, or something else? A professional implementation could take into account the decreasing sizes of the arrays being used as the factorization proceeds and use a small ν at the start, then increase it as the columns in U and V become shorter. What is the maximal performance boost that could be expected from doing that? If small, then why bother? If large, how much time should be allocated to coding that more dynamic version?

All of those questions can be answered with the tools you now have:

Identify what different BLAS operations are used in each step above
Compute the memory reference to flop ratio of each of those kernels
Determine the effective size of the cache(s) on the target machine
Find approximately what the optimal block size is for a matrix*matrix operation on the target machine

The first two require no coding at all, and you should be able to knock those off in an hour. Their information also will not depend on which normalization or storage scheme used for the Housholder reflectors.

Next: Singular value factorization

Started: Mon 23 Aug 2010, 06:00 PM

Modified: Tue 09 Nov 2010, 12:48 PM to add in the Q_1 versus U_1 note

Last Modified: Mon 04 Nov 2019, 07:19 AM