Linear Least Squares: Definitions and Basics

Preliminary conventions and references:

Here "min_x" means to "minimize over all values of the vector x", and is usually written beneath the word "min". The closest typographical analogue in HTML is a subscript. The terminology "argmin_x" makes explicit the idea that the vector x is what is required, not just the minimum value of ||Ax-b||₂
The full name of the problem is "linear least squares", but to save typing I'll use "least squares" or the acronym "LLS". Yes, Virginia, there are also nonlinear least squares problems, but many of them involve solving a sequence of linear least squares problems and all of them rely analytically and numerically upon LLS ideas.
For a mathematical operation, concatenation means multiplication, e.g., y = Ax. For coding either an asterisk (*) or times symbol (×) is used, e.g., y = A*x
The single best reference for most numerical linear algebra algorithms and implementation is Golub and Van Loan, Matrix Computations and most of the material below can be found in it. You don't need it for this course, but if you do end up using numerical linear algebra for other things, it's worth the money. Currently (2019) the latest edition is 4, but earlier editions suffice and are much cheaper.
A standard reference book is Solving Least Squares Problems by Lawson and Hanson. [The same two authors also have a later book, Solving Least Squares Problems with Linear Inequality Constraints .]
Numerical Methods for Least Squares Problems by Åke Björck provides many details about specific subproblems within the area of LLS.

Linear algebra and LLS notation and ideas:

Linear least squares (LLS) problems have the form

min _x || Ax − b ||₂

where A is a given m×n matrix, b is a given m-vector, and the minimum is taken over all n-vectors x.

In applications LLS corresponds to a linear model of some quantity that depends on n parameters (the entries in the vector x), and for which m observations or experiments have been carried out. A single observation gives a single value (which goes into the corresponding entry of the vector b) as a linear combination of the underlying parameters. Almost always the number of observations/experiments carried out is larger than the number of parameters, i.e., m > n. The observed values are stacked up in the m-vector b, and the coefficients of the linear combination of parameters are placed as the corresponding row of A.

The basics of linear algebra (like 'linear combination', etc.) have already been posted, but review the info about subspaces and orthogonal matrices if needed. Some additional facts and notation for an m×n matrix A:

range(A) = {y: y = Ax for some x ∈ ℜⁿ }
null(A) = {x: Ax = 0 }. Note that for an m x n matrix A, range(A) consists of m-vectors, while null(A) consists of n-vectors. So they live in completely different worlds and in general you cannot add vectors from range(A) with ones from null(A).
The rank of A is the maximal number of linearly independent rows (or columns) in A. A square matrix A is nonsingular if and only if it is full rank, i.e., rank(A) = n. When A is m x n with m > n and A is full rank, it has a trivial null space: null(A) = {0}, where 0 is the n-vector of all zeros.
The residual vector r = b − Ax is used heavily in least squares (and in iterative methods for nonsingular linear systems). Technically the residual is a function of x: r(x) = b − Ax, but that dependence is rarely explicitly stated in functional form. If r = 0, then the LLS problem has been solved, because then b = Ax. For LLS problems the residual is almost never zero, and instead a different characterization is used to determine optimality of a given x.
An overdetermined system has m > n

An underdetermined system has m < n

When m < n or null(A) is nontrivial, sometimes the minimum norm least squares solution is desired. An underdetermined system has an infinite number of solutions of the form x₀ + z, where x₀ is any particular vector that minimizes || Ax − b ||₂ and z ∈ null(A). This makes sense because A*z = 0, so adding z to x₀ has no effect on the residual: r₀ = b − A*x₀ = b − A*(x₀ + z) So when null(A) contains a nonzero vector z, x₀ + z, x₀ + (10¹⁰⁰)*z, and x₀ − (21526766112^{879430287954321})!*z also solve the LLS problem. Since a least squares solution vector x could be arbitrarily large in size (and you have to admit, (21526766112^{879430287954321})! is a pretty good example of "arbitrarily large") for underdetermined systems, to prevent overflow the least squares solution of smallest two-norm among all possible solutions x₀ + z, z ∈ null(A), is sought. That particular x is unique.

Some not-necessarily mathematical observations:

As already noted, most least squares problems in applications are overdetermined (i.e., m > n).
For underdetermined systems, in computational practice all that is really needed is a solution x that is bounded in size, and not necessarily the unique one of minimal norm. The issue is loss of precision in a floating point representation, and if a solution x has a norm 100 times larger than the minimum norm solution, that is almost always acceptable with 64-bit double precision numbers.
On the other hand, a solution x with norm (21526766112^{879430287954321})! times as large as the minimal one is probably not going to fly. A rough rule is that the norm can be a factor of 100*m*n larger than the minimal one. This rule has no basis whatsoever in mathematical analysis or a lifetime of observing what happens in least squares problems from a broad range of scientific apps, so don't take it as divine revelation (or even as a Satanic revelation).
Choosing to minimize the two-norm of the residual vector r = b − Ax is sometimes an abitrary choice. For mathematicians, one advantage of the two-norm is that when it is squared, the resulting function of x is differentiable and so the math works out easier. In many applications minimizing the 1-norm or the ∞-norm is a significantly better choice.
Geometrically, a least squares problem corresponds to finding a vector x so that the corresponding residual r = b −Ax is orthogonal to the columns of A. That characterization gives the normal equations: A^Tr = A^T(b − Ax) = 0 or equivalently, A^TA = A^Tb.
The normal equations give the "different characterization" for the residual vector mentioned above: A vector x solves LLS if and only if it satisfies the normal equations (and so, iff A^Tr = 0).

Next: ways of solving LLS

Last Modified: Mon 04 Nov 2019, 07:20 AM