Singular Value Decomposition and Linear Least Squares

The most space and computation efficient ways of solving LLS problems rely on factorization (AKA decomposition) of an m × n coefficient matrix A using orthogonal matrices. The key property of an orthogonal matrix is that it does not change the two-norm when multiplying it times a vector or matrix:

||B||2 = ||Q*B||2,

for any orthogonal matrix Q and conformant matrix B with n rows (B can also be an n × 1 vector). This norm-preserving property holds up surprisingly well both in theory and in actual numerical computations. The more efficient method of factoring A using an orthogonal matrix is QR factorization, but the singular value decomposition (SVD) provides useful and more complete analytical information about LLS. Don't forget the warning: the QR factorization is not the same thing as the QR algorithm. The QR algorithm is an iterative method for finding the eigenvalues of the matrix A, which involves computing a QR factorization on each iteration. Of course, when A is rectangular the QR algorithm cannot be what is used. Why is that so?


The Singular Value Decomposition

The singular value decomposition is possibly the single most informative if not important matrix decomposition for numerical methods and theory. The SVD is a decomposition (factorization) of a matrix A into the product of three matrices:

A = UDVT,

where

Beware: the dimensions above are for the "full SVD". Some textbooks instead use a "space-saver version" of the SVD, where U is m × r with r = rank(A). Another space-saver variant has U is m × n when n ≤ m. When r < n, the second space-saver version will have n - r extra columns in U, and have n - r singular values that are exactly zero on the diagonal of D. Furthermore, some sources do not use the terminology of "full" or "space-saver" versions, and you need to check which variant is being used in a given work. In particular be careful when calling library routines that compute the SVD, and make sure you know what you will get back.

As part of the factorization the diagonal entries of D are usually ordered as

σ1 ≥ σ2 ≥ σ3 ≥ … ≥ σp ≥ 0,

where p = min(m, n). Question: if A is m × n and m ≥ n, this description of the ordering of singular values implicitly assumes which of the variants mentioned above is being used?

The diagonal entries of D are called the singular values of A. The ith left singular vector of A is the ith column of U, and the ith right singular vector of A is column i of V. This numbering assumes the singular values are ordered in descending order, so that the first left singular vector of A corresponds to the largest singular value of A. In all that follows, that ordering is assumed.


Properties of the SVD

The SVD reveals more information about a matrix than any other decomposition. When A is symmetric and positive definite, the Schur decomposition (eigendecomposition) A = QDQT gives the eigenvalues in the diagonal matrix D and the eigenvectors in the columns of the orthogonal matrix Q. In this case, the Schur decomposition is the same as the SVD (with U = V = Q), and for this reason the SVD can be viewed a better way to extend the ideas of eigenvalues of square matrices to nonsymmetric and rectangular matrices. Information that the SVD reveals include:

  1. If r = rank(A), then σr+1 = σr+2 = … = σp = 0 , where p = min(m, n).

  2. σ1 = ||A||2

  3. σ12 + σ22 + … + σp2 = ||A||F2

  4. σi2 is the ith largest eigenvalue of ATA
  5. σi2 is the ith largest eigenvalue of AAT
  6. Partition U by columns with U = [ U1 U2 ], where U1 has r columns. Then the columns of U1 form an orthonormal basis for range(A)

  7. Partition V by columns with V = [ V1 V2 ], where V1 has r columns. Then the columns of V2 form an orthonormal basis for null(A)

  8. The relation A = UDVT can be written as a sum of rank-1 matrices:
    A = ∑i=1:n σi ui vTi

    Each term in the summation is a rank-1 matrix, and since the vectors ui and vi are columns of orthogonal matrices, they are linearly independent. The ith term has the singular value σi times the outter product of a m-vector ui and an n-vector vi. Since the vectors ui and vi are columns of orthogonal matrices, their 2-norm is 1. So any "weight" the terms have comes solely from the multiplier, thesingular value σi. Because the singular values are sorted into decreasing size, heuristically that means the first terms have more importance than the later terms.

  9. Now the condition number of a matrix can be defined, even if the matrix is not square and nonsingular. For least squares problems,
    κ(A) = σ1 / σr

    where σr is the smallest nonzero singular value of A. Recall, a "condition number" for a matrix can have different definitions in different contexts. For a nonsingular (and hence square) matrix A, it was κ(A) = ||A||*||A-1||. Is that the same number you get in least squares, for the special case of the matrix A being square and nonsingular? You have enough information at this point to be able to answer the question, with either a proof or a small counterexample. So consider this as quiz question 3. Don't both reading further without doing this - the rest of this material won't be worth dirt if you don't understand or are unable to figure this out.

  10. The SVD of the transpose of a matrix A is

    AT = VDTUT,

    where U, D, and V are the matrices given by the SVD of A. This is one reason assuming that A is m × n with m ≥ n is reasonable: the SVD of the transpose is the transpose of the SVD.


Quick Questions

Make sure you have the idea down. The SVD is well-defined for any matrix, and you should be able to quickly state what it is for

Even without having seen any algorithms for obtaining the SVD, you should be able to figure out what the SVD is for those special cases. Answering those is best done by drawing a picture of just what shapes and sizes the matrices U, D, and V will be in each case.


Low Rank Approximation

The most important SVD property for modern applications is based on the summation of rank-1 matrices property above. Define the partial sum matrix
Ak = ∑i=1:k σi ui vTi.

Then Ak is the best rank-k approximation to A in the Frobenius norm. This is of great importance in data reduction. It means that a huge matrix A containing data can be approximated by a much smaller one. To pin down this advantage better, some problems have m ≈ 100M rows and about n ≈ 0.5M columns. In those applications, using k = 30 is sometimes adequate to capture the important info from the matrix, requiring over 16,000 times less storage and processing. The most important performance engineering principle in scientific computing is that numerical computation is not the bottleneck or limiting factor, but moving data around is. So the ability to represent or use information with significantly less data can be a major win. This "best approximation" property from the SVD gives a mathematically rigorous approach to doing this systematically.

One example application of this "low-rank approximation property" is a document-text matrix from analyzing a large set of documents (e.g., all of the tens of thousands of published research articles about cancer). The document-text matrix A is defined such that A(i,j) = the number of times word j occurs in document i. Another example is a large database of pictures of faces, with a set of 30-40 characteristics measured for each face (nose length, eye width/height ratio, etc). Then the entry A(i,j) is the jth characteristic value for the ith face. A recent application of interest is the ability to identify people from video cameras mounted in public places, and tracking someone when they move from the range of view of one camera and later enter the range of view of another. The SVD can capture the most important features by just saving the first few singular values and corresponding singular vectors with big savings in memory, computation, and time. The SVD also allows stating mathematically exactly how accurate that approximation is (at least, in a particular norm).


Solving LLS Problems with the SVD

The SVD gives not just a solution to LLS, but also provides the minimum norm solution when null(A) is non-trivial. One heuristic for the formula that gives the min-norm LLS x* is to consider what a "solution" to Ax = b is when A = UDVT and all of A, U, D, and V are nonsingular. In that case,

x* = A-1b = (UDVT)-1b,

and since the inverse of a product is the product of inverses in reverse order,

(UDVT)-1 = V-T D-1 U-1

Since U and V are orthogonal, U-1 = UT and V-T = V. That means

x* = A-1b = VD-1UTb

For a more general case, notice everything in the last expression is ok, except possibly D-1 since some of the singular values (the diagonal elements of D) might be zero. So we'll just do what is possible: invert the diagonal entries in D that are nonzero, but leave the other diagonal entries alone as zeros. That gives ...

The SVD min-norm least squares solution

The minimum norm least squares solution x* to minx || Ax − b ||2 is

x* = VDUT b,

where D is the n × m diagonal matrix with the diagonal entries σi = σi−1 if σi ≠ 0, and σi = 0 if σi = 0. [The symbol † is "dagger", so D is pronounced "D dagger".]

Moore-Penrose pseudoinverse

The matrix A := VDUT in the formula for the min norm least squares solution is called the Moore-Penrose pseudoinverse, and the Matlab function pinv() returns it for any matrix A. It is returned as a single matrix, not in the SVD factored form, so for most applications it is better to compute and keep the U, D, and V from the SVD instead.

The name suggests other pseudoinverses exist, and indeed there are bucket-loads of them. In general they invert what can be inverted, and differ mainly on what they do on null(A). Most often, the term "pseudoinverse" without qualification means the Moore-Penrose one, but beware some older works, especially in engineering, may use a different p-inverse.


Space-saver SVD

Almost none of the useful properties above require U2 in the partitioning of U, because those columns are being multiplied by zeros in D. Discarding those yields the space-saver version of the SVD: for m ≥ n, A = UDVT as before, but now D is n × n and U has just the first n columns of the full SVD matrix U (called U1 above). In Matlab, [U,D,V] = svd(A) gives the full SVD of A, and [U,D,V] = svd(A,0) is the space-saver version.

The SVD is the most numerically reliable method, is the only completely reliable one for rank-deficient A, and works for any matrix A. The problem is that the SVD is expensive to compute, often requiring an iterative process that can take an unknown number of steps. More difficult is the "rank determination" problem: when the singular values start dribbling off to small values, how do you decide when to declare them effectively zero? Nevertheless, for small problems the SVD is the best approach. To get an idea of what "small" means, a 2.8 Ghertz quad-core Intel with 12 Gbytes of memory takes takes less than 1.5 seconds to compute the space-saver version of SVD of a 5000 x 1000 matrix in Matlab.


Hunh?

Want to know if you understand this material? Easy: consider using the space-saver SVD in the Moore-Penrose formula for the min-norm LLS solution x*. What needs to change to avoid using any of U2? What if rank(A) = r < n? Work through those cases, and keep track of just what size each (sub)matrix is as you go along. Beware: U1T U1 = I (and just what size is the identity matrix I in the last equation?), but U1U1T ≠ I (and just what size is matrix U1U1T?). In class, U1U1T was stated to be an orthogonal projector; what famous subspace related to the matrix A does it project onto? What about I - U1U1T?


Summary and General Observations about Least Squares

Linear least squares problems have a slew of methods for solving them, which means a lot of flexibility is available for choosing numerically stable and efficient ones. If size and time are not a problem use the SVD. It will give the minimum-norm least squares solution, and all the auxillary information needed to determine just how reliable the solution is and what good or bad properties the original matrix A has.

For larger problems, use QR. Two major ways (Givens and Householder) are available and each gives different computational kernels. Load/store analysis indicates Householder reflectors are a better choice because they use level 2 instead of level 1 BLAS. As usual, deferred updates then provide a way to boost the BLAS-2 version to a BLAS-3 one.

For extreme-scale large problems, an iterative method is often a better choice. Those have not been covered in these pages. However, entire books have been written on least squares. Topics not treated here include:


  • Started: 21 Oct 2010
  • Modified: Wed 12 Nov 2014, 07:38 AM to change "better choice" to compare 2 to 1, not 2 to 3.
  • Last Modified: Mon 04 Nov 2019, 07:19 AM