Singular Value Decomposition and Linear Least Squares

The most space and computation efficient ways of solving LLS problems rely on factorization (AKA decomposition) of an m × n coefficient matrix A using orthogonal matrices. The key property of an orthogonal matrix is that it does not change the two-norm when multiplying it times a vector or matrix:

||B||₂ = ||Q*B||₂,

for any orthogonal matrix Q and conformant matrix B with n rows (B can also be an n × 1 vector). This norm-preserving property holds up surprisingly well both in theory and in actual numerical computations. The more efficient method of factoring A using an orthogonal matrix is QR factorization, but the singular value decomposition (SVD) provides useful and more complete analytical information about LLS. Don't forget the warning: the QR factorization is not the same thing as the QR algorithm. The QR algorithm is an iterative method for finding the eigenvalues of the matrix A, which involves computing a QR factorization on each iteration. Of course, when A is rectangular the QR algorithm cannot be what is used. Why is that so?

The Singular Value Decomposition

The singular value decomposition is possibly the single most informative if not important matrix decomposition for numerical methods and theory. The SVD is a decomposition (factorization) of a matrix A into the product of three matrices:

A = UDV^T,

where

U is an m × m orthogonal matrix
V is an n × n orthogonal matrix
D is an m × n diagonal matrix

Beware: the dimensions above are for the "full SVD". Some textbooks instead use a "space-saver version" of the SVD, where U is m × r with r = rank(A). Another space-saver variant has U is m × n when n ≤ m. When r < n, the second space-saver version will have n - r extra columns in U, and have n - r singular values that are exactly zero on the diagonal of D. Furthermore, some sources do not use the terminology of "full" or "space-saver" versions, and you need to check which variant is being used in a given work. In particular be careful when calling library routines that compute the SVD, and make sure you know what you will get back.

As part of the factorization the diagonal entries of D are usually ordered as

σ₁ ≥ σ₂ ≥ σ₃ ≥ … ≥ σ_p ≥ 0,

where p = min(m, n). Question: if A is m × n and m ≥ n, this description of the ordering of singular values implicitly assumes which of the variants mentioned above is being used?

The diagonal entries of D are called the singular values of A. The i^th left singular vector of A is the i^th column of U, and the i^th right singular vector of A is column i of V. This numbering assumes the singular values are ordered in descending order, so that the first left singular vector of A corresponds to the largest singular value of A. In all that follows, that ordering is assumed.

Properties of the SVD

The SVD reveals more information about a matrix than any other decomposition. When A is symmetric and positive definite, the Schur decomposition (eigendecomposition) A = QDQ^T gives the eigenvalues in the diagonal matrix D and the eigenvectors in the columns of the orthogonal matrix Q. In this case, the Schur decomposition is the same as the SVD (with U = V = Q), and for this reason the SVD can be viewed a better way to extend the ideas of eigenvalues of square matrices to nonsymmetric and rectangular matrices. Information that the SVD reveals include:

If r = rank(A), then σ_r+1 = σ_r+2 = … = σ_p = 0 , where p = min(m, n).
σ₁ = ||A||₂
σ₁² + σ₂² + … + σ_p² = ||A||_F²
σ_i² is the i^th largest eigenvalue of A^TA

σ_i² is the i^th largest eigenvalue of AA^T

Partition U by columns with U = [ U₁ U₂ ], where U₁ has r columns. Then the columns of U₁ form an orthonormal basis for range(A)
Partition V by columns with V = [ V₁ V₂ ], where V₁ has r columns. Then the columns of V₂ form an orthonormal basis for null(A)
The relation A = UDV^T can be written as a sum of rank-1 matrices: A = ∑_i=1:n σ_i u_i v^T_i
Each term in the summation is a rank-1 matrix, and since the vectors u_i and v_i are columns of orthogonal matrices, they are linearly independent. The i^th term has the singular value σ_i times the outter product of a m-vector u_i and an n-vector v_i. Since the vectors u_i and v_i are columns of orthogonal matrices, their 2-norm is 1. So any "weight" the terms have comes solely from the multiplier, thesingular value σ_i. Because the singular values are sorted into decreasing size, heuristically that means the first terms have more importance than the later terms.
Now the condition number of a matrix can be defined, even if the matrix is not square and nonsingular. For least squares problems, κ(A) = σ₁ / σ_r
where σ_r is the smallest nonzero singular value of A. Recall, a "condition number" for a matrix can have different definitions in different contexts. For a nonsingular (and hence square) matrix A, it was κ(A) = ||A||*||A^-1||. Is that the same number you get in least squares, for the special case of the matrix A being square and nonsingular? You have enough information at this point to be able to answer the question, with either a proof or a small counterexample. So consider this as quiz question 3. Don't both reading further without doing this - the rest of this material won't be worth dirt if you don't understand or are unable to figure this out.
The SVD of the transpose of a matrix A is

A^T = VD^TU^T,
where U, D, and V are the matrices given by the SVD of A. This is one reason assuming that A is m × n with m ≥ n is reasonable: the SVD of the transpose is the transpose of the SVD.

Quick Questions

Make sure you have the idea down. The SVD is well-defined for any matrix, and you should be able to quickly state what it is for

a scalar x treated as a 1 × 1 matrix
a m × 1 vector treated as a matrix
a 1 × n vector treated as a matrix
an m × n matrix of all zeros
the identity matrix of order n
a m × n diagonal matrix where the diagonal may have some positive, some negative, and some zero entries.

Even without having seen any algorithms for obtaining the SVD, you should be able to figure out what the SVD is for those special cases. Answering those is best done by drawing a picture of just what shapes and sizes the matrices U, D, and V will be in each case.

Low Rank Approximation

The most important SVD property for modern applications is based on the summation of rank-1 matrices property above. Define the partial sum matrix A_k = ∑_i=1:k σ_i u_i v^T_i.

Then A_k is the best rank-k approximation to A in the Frobenius norm. This is of great importance in data reduction. It means that a huge matrix A containing data can be approximated by a much smaller one. To pin down this advantage better, some problems have m ≈ 100M rows and about n ≈ 0.5M columns. In those applications, using k = 30 is sometimes adequate to capture the important info from the matrix, requiring over 16,000 times less storage and processing. The most important performance engineering principle in scientific computing is that numerical computation is not the bottleneck or limiting factor, but moving data around is. So the ability to represent or use information with significantly less data can be a major win. This "best approximation" property from the SVD gives a mathematically rigorous approach to doing this systematically.

One example application of this "low-rank approximation property" is a document-text matrix from analyzing a large set of documents (e.g., all of the tens of thousands of published research articles about cancer). The document-text matrix A is defined such that A(i,j) = the number of times word j occurs in document i. Another example is a large database of pictures of faces, with a set of 30-40 characteristics measured for each face (nose length, eye width/height ratio, etc). Then the entry A(i,j) is the j^th characteristic value for the i^th face. A recent application of interest is the ability to identify people from video cameras mounted in public places, and tracking someone when they move from the range of view of one camera and later enter the range of view of another. The SVD can capture the most important features by just saving the first few singular values and corresponding singular vectors with big savings in memory, computation, and time. The SVD also allows stating mathematically exactly how accurate that approximation is (at least, in a particular norm).

Solving LLS Problems with the SVD

The SVD gives not just a solution to LLS, but also provides the minimum norm solution when null(A) is non-trivial. One heuristic for the formula that gives the min-norm LLS x^* is to consider what a "solution" to Ax = b is when A = UDV^T and all of A, U, D, and V are nonsingular. In that case,

x^* = A^-1b = (UDV^T)^-1b,

and since the inverse of a product is the product of inverses in reverse order,

(UDV^T)^-1 = V^-T D^-1 U^-1

Since U and V are orthogonal, U^-1 = U^T and V^-T = V. That means

x^* = A^-1b = VD^-1U^Tb

For a more general case, notice everything in the last expression is ok, except possibly D^-1 since some of the singular values (the diagonal elements of D) might be zero. So we'll just do what is possible: invert the diagonal entries in D that are nonzero, but leave the other diagonal entries alone as zeros. That gives ...

The SVD min-norm least squares solution

The minimum norm least squares solution x^* to min_x || Ax − b ||₂ is

x^* = VD^†U^T b,

where D^† is the n × m diagonal matrix with the diagonal entries σ_i^† = σ_i⁻¹ if σ_i ≠ 0, and σ_i^† = 0 if σ_i = 0. [The symbol † is "dagger", so D^† is pronounced "D dagger".]

Moore-Penrose pseudoinverse

The matrix A^† := VD^†U^T in the formula for the min norm least squares solution is called the Moore-Penrose pseudoinverse, and the Matlab function pinv() returns it for any matrix A. It is returned as a single matrix, not in the SVD factored form, so for most applications it is better to compute and keep the U, D, and V from the SVD instead.

The name suggests other pseudoinverses exist, and indeed there are bucket-loads of them. In general they invert what can be inverted, and differ mainly on what they do on null(A). Most often, the term "pseudoinverse" without qualification means the Moore-Penrose one, but beware some older works, especially in engineering, may use a different p-inverse.

Space-saver SVD

Almost none of the useful properties above require U₂ in the partitioning of U, because those columns are being multiplied by zeros in D. Discarding those yields the space-saver version of the SVD: for m ≥ n, A = UDV^T as before, but now D is n × n and U has just the first n columns of the full SVD matrix U (called U₁ above). In Matlab, [U,D,V] = svd(A) gives the full SVD of A, and [U,D,V] = svd(A,0) is the space-saver version.

The SVD is the most numerically reliable method, is the only completely reliable one for rank-deficient A, and works for any matrix A. The problem is that the SVD is expensive to compute, often requiring an iterative process that can take an unknown number of steps. More difficult is the "rank determination" problem: when the singular values start dribbling off to small values, how do you decide when to declare them effectively zero? Nevertheless, for small problems the SVD is the best approach. To get an idea of what "small" means, a 2.8 Ghertz quad-core Intel with 12 Gbytes of memory takes takes less than 1.5 seconds to compute the space-saver version of SVD of a 5000 x 1000 matrix in Matlab.

Hunh?

Want to know if you understand this material? Easy: consider using the space-saver SVD in the Moore-Penrose formula for the min-norm LLS solution x^*. What needs to change to avoid using any of U₂? What if rank(A) = r < n? Work through those cases, and keep track of just what size each (sub)matrix is as you go along. Beware: U₁^T U₁ = I (and just what size is the identity matrix I in the last equation?), but U₁U₁^T ≠ I (and just what size is matrix U₁U₁^T?). In class, U₁U₁^T was stated to be an orthogonal projector; what famous subspace related to the matrix A does it project onto? What about I - U₁U₁^T?

Summary and General Observations about Least Squares

Linear least squares problems have a slew of methods for solving them, which means a lot of flexibility is available for choosing numerically stable and efficient ones. If size and time are not a problem use the SVD. It will give the minimum-norm least squares solution, and all the auxillary information needed to determine just how reliable the solution is and what good or bad properties the original matrix A has.

For larger problems, use QR. Two major ways (Givens and Householder) are available and each gives different computational kernels. Load/store analysis indicates Householder reflectors are a better choice because they use level 2 instead of level 1 BLAS. As usual, deferred updates then provide a way to boost the BLAS-2 version to a BLAS-3 one.

For extreme-scale large problems, an iterative method is often a better choice. Those have not been covered in these pages. However, entire books have been written on least squares. Topics not treated here include:

Using pivoting in QR, which can break down even for highly reputable (well, at least well-conditioned) matrices
When A is not full rank, even using pivoting may still lead to an early break down in the computations
Numerically, singular values exactly equal to zero almost never occur. Instead they dribble down logarithmically to machine tolerance, and determining the rank of the matrix requires knowledge of the application and human intervention.
For sparse matrices, Givens rotations are sometimes more efficient than Householder reflectors, so don't carry them out with the trash yet
Also for sparse matrices, reordering the equations can be helpful to reduce storage requirements
For some iterative methods a sequence of LLS are solved on each iteration, but only a few columns in the matrix A change from one iteration to the next. In this case an "update/downdate" capability is a big win in both time and computation
Sometimes (e.g., interior point methods in linear programming) the least squares problem to be solved has coefficient matrix W_k*A, where W_k is a diagonal weighting matrix that changes from step to step, while A is from the problem constraints and is fixed. Computation can be decreased and accuracy increased by taking advantage of that special feature, and similar advantages often can be found when the problem itself provides some form of structure

Started: 21 Oct 2010

Modified: Wed 12 Nov 2014, 07:38 AM to change "better choice" to compare 2 to 1, not 2 to 3.

Last Modified: Mon 04 Nov 2019, 07:19 AM