The most space and computation efficient ways of solving LLS problems rely on factorization (AKA decomposition) of an m × n coefficient matrix A using orthogonal matrices. The key property of an orthogonal matrix is that it does not change the two-norm when multiplying it times a vector or matrix:
for any orthogonal matrix Q and conformant matrix B with n rows (B can also be an n × 1 vector). This norm-preserving property holds up surprisingly well both in theory and in actual numerical computations. The more efficient method of factoring A using an orthogonal matrix is QR factorization, but the singular value decomposition (SVD) provides useful and more complete analytical information about LLS. Don't forget the warning: the QR factorization is not the same thing as the QR algorithm. The QR algorithm is an iterative method for finding the eigenvalues of the matrix A, which involves computing a QR factorization on each iteration. Of course, when A is rectangular the QR algorithm cannot be what is used. Why is that so?
The singular value decomposition is possibly the single most informative if not important matrix decomposition for numerical methods and theory. The SVD is a decomposition (factorization) of a matrix A into the product of three matrices:
where
As part of the factorization the diagonal entries of D are usually ordered as
where p = min(m, n). Question: if A is m × n and m ≥ n, this description of the ordering of singular values implicitly assumes which of the variants mentioned above is being used?
The diagonal entries of D are called the singular values of A. The ith left singular vector of A is the ith column of U, and the ith right singular vector of A is column i of V. This numbering assumes the singular values are ordered in descending order, so that the first left singular vector of A corresponds to the largest singular value of A. In all that follows, that ordering is assumed.
The SVD reveals more information about a matrix than any other decomposition. When A is symmetric and positive definite, the Schur decomposition (eigendecomposition) A = QDQT gives the eigenvalues in the diagonal matrix D and the eigenvectors in the columns of the orthogonal matrix Q. In this case, the Schur decomposition is the same as the SVD (with U = V = Q), and for this reason the SVD can be viewed a better way to extend the ideas of eigenvalues of square matrices to nonsymmetric and rectangular matrices. Information that the SVD reveals include:
Each term in the summation is a rank-1 matrix, and since the vectors ui and vi are columns of orthogonal matrices, they are linearly independent. The ith term has the singular value σi times the outter product of a m-vector ui and an n-vector vi. Since the vectors ui and vi are columns of orthogonal matrices, their 2-norm is 1. So any "weight" the terms have comes solely from the multiplier, thesingular value σi. Because the singular values are sorted into decreasing size, heuristically that means the first terms have more importance than the later terms.
where σr is the smallest nonzero singular value of A. Recall, a "condition number" for a matrix can have different definitions in different contexts. For a nonsingular (and hence square) matrix A, it was κ(A) = ||A||*||A-1||. Is that the same number you get in least squares, for the special case of the matrix A being square and nonsingular? You have enough information at this point to be able to answer the question, with either a proof or a small counterexample. So consider this as quiz question 3. Don't both reading further without doing this - the rest of this material won't be worth dirt if you don't understand or are unable to figure this out.
where U, D, and V are the matrices given by the SVD of A. This is one reason assuming that A is m × n with m ≥ n is reasonable: the SVD of the transpose is the transpose of the SVD.
Make sure you have the idea down. The SVD is well-defined for any matrix, and you should be able to quickly state what it is for
Then Ak is the best rank-k approximation to A in the Frobenius norm. This is of great importance in data reduction. It means that a huge matrix A containing data can be approximated by a much smaller one. To pin down this advantage better, some problems have m ≈ 100M rows and about n ≈ 0.5M columns. In those applications, using k = 30 is sometimes adequate to capture the important info from the matrix, requiring over 16,000 times less storage and processing. The most important performance engineering principle in scientific computing is that numerical computation is not the bottleneck or limiting factor, but moving data around is. So the ability to represent or use information with significantly less data can be a major win. This "best approximation" property from the SVD gives a mathematically rigorous approach to doing this systematically.
One example application of this "low-rank approximation property" is a document-text matrix from analyzing a large set of documents (e.g., all of the tens of thousands of published research articles about cancer). The document-text matrix A is defined such that A(i,j) = the number of times word j occurs in document i. Another example is a large database of pictures of faces, with a set of 30-40 characteristics measured for each face (nose length, eye width/height ratio, etc). Then the entry A(i,j) is the jth characteristic value for the ith face. A recent application of interest is the ability to identify people from video cameras mounted in public places, and tracking someone when they move from the range of view of one camera and later enter the range of view of another. The SVD can capture the most important features by just saving the first few singular values and corresponding singular vectors with big savings in memory, computation, and time. The SVD also allows stating mathematically exactly how accurate that approximation is (at least, in a particular norm).
The SVD gives not just a solution to LLS, but also provides the minimum norm solution when null(A) is non-trivial. One heuristic for the formula that gives the min-norm LLS x* is to consider what a "solution" to Ax = b is when A = UDVT and all of A, U, D, and V are nonsingular. In that case,
and since the inverse of a product is the product of inverses in reverse order,
Since U and V are orthogonal, U-1 = UT and V-T = V. That means
For a more general case, notice everything in the last expression is ok, except possibly D-1 since some of the singular values (the diagonal elements of D) might be zero. So we'll just do what is possible: invert the diagonal entries in D that are nonzero, but leave the other diagonal entries alone as zeros. That gives ...
The minimum norm least squares solution x* to minx || Ax − b ||2 is
where D† is the n × m diagonal matrix with the diagonal entries σi† = σi−1 if σi ≠ 0, and σi† = 0 if σi = 0. [The symbol † is "dagger", so D† is pronounced "D dagger".]
The name suggests other pseudoinverses exist, and indeed there are bucket-loads of them. In general they invert what can be inverted, and differ mainly on what they do on null(A). Most often, the term "pseudoinverse" without qualification means the Moore-Penrose one, but beware some older works, especially in engineering, may use a different p-inverse.
Almost none of the useful properties above require U2 in the partitioning of U, because those columns are being multiplied by zeros in D. Discarding those yields the space-saver version of the SVD: for m ≥ n, A = UDVT as before, but now D is n × n and U has just the first n columns of the full SVD matrix U (called U1 above). In Matlab, [U,D,V] = svd(A) gives the full SVD of A, and [U,D,V] = svd(A,0) is the space-saver version.
The SVD is the most numerically reliable method, is the only completely reliable one for rank-deficient A, and works for any matrix A. The problem is that the SVD is expensive to compute, often requiring an iterative process that can take an unknown number of steps. More difficult is the "rank determination" problem: when the singular values start dribbling off to small values, how do you decide when to declare them effectively zero? Nevertheless, for small problems the SVD is the best approach. To get an idea of what "small" means, a 2.8 Ghertz quad-core Intel with 12 Gbytes of memory takes takes less than 1.5 seconds to compute the space-saver version of SVD of a 5000 x 1000 matrix in Matlab.
Want to know if you understand this material? Easy: consider using the space-saver SVD in the Moore-Penrose formula for the min-norm LLS solution x*. What needs to change to avoid using any of U2? What if rank(A) = r < n? Work through those cases, and keep track of just what size each (sub)matrix is as you go along. Beware: U1T U1 = I (and just what size is the identity matrix I in the last equation?), but U1U1T ≠ I (and just what size is matrix U1U1T?). In class, U1U1T was stated to be an orthogonal projector; what famous subspace related to the matrix A does it project onto? What about I - U1U1T?
For larger problems, use QR. Two major ways (Givens and Householder) are available and each gives different computational kernels. Load/store analysis indicates Householder reflectors are a better choice because they use level 2 instead of level 1 BLAS. As usual, deferred updates then provide a way to boost the BLAS-2 version to a BLAS-3 one.
For extreme-scale large problems, an iterative method is often a better choice. Those have not been covered in these pages. However, entire books have been written on least squares. Topics not treated here include: