Linear Least Squares: Definitions and Basics
Preliminary conventions and references:
- Here "minx" means
to "minimize over all values of the vector x", and is usually
written beneath the word "min". The closest typographical analogue
in HTML is a subscript. The terminology "argminx" makes
explicit the idea that the vector x is what is required,
not just the minimum value of ||Ax-b||2
- The full name of the problem is "linear least squares", but to
save typing I'll use "least squares" or the acronym "LLS".
Yes, Virginia, there are also nonlinear least squares problems,
but many of them involve solving a sequence of linear least squares
problems and all of them rely analytically and numerically
upon LLS ideas.
- For a mathematical operation, concatenation means multiplication,
e.g., y = Ax. For coding either an asterisk (*) or times symbol (×)
is used, e.g., y = A*x
- The single best reference for most numerical linear algebra algorithms and
implementation is Golub
and Van Loan, Matrix Computations and most of the material below can be
found in it. You don't need it for this course, but if you do end up using
numerical linear algebra for other things, it's worth the money. Currently
(2019) the latest edition is 4, but earlier editions suffice and are much
cheaper.
- A standard reference book is Solving Least Squares Problems by Lawson and Hanson.
[The same two authors also have a later book, Solving Least Squares Problems
with Linear Inequality Constraints .]
- Numerical Methods for Least Squares Problems by Åke Björck
provides many details about specific subproblems within the area of LLS.
Linear algebra and LLS notation and ideas:
Linear least squares (LLS) problems have the form
min x || Ax − b ||2
where A is a given m×n matrix, b is a given m-vector, and the minimum
is taken over all n-vectors x.
In applications LLS corresponds to a linear model of some quantity that
depends on n parameters (the entries in the vector x), and for which m
observations or experiments have been carried out. A single observation
gives a single value (which goes into the corresponding entry of the vector
b) as a linear combination of the underlying parameters.
Almost always the number of observations/experiments carried out is larger
than the number of parameters, i.e., m > n. The observed values are
stacked up in the m-vector b, and the coefficients of the linear combination
of parameters are placed as the corresponding row of A.
The basics of linear algebra (like 'linear combination', etc.) have
already been posted,
but review the info about subspaces and orthogonal matrices if needed.
Some additional facts and notation for an m×n matrix A:
-
range(A) = {y: y = Ax for some x ∈ ℜn }
-
null(A) = {x: Ax = 0 }. Note that for an m x n matrix A, range(A) consists of
m-vectors, while null(A) consists of n-vectors. So they live in completely different
worlds and in general you cannot add vectors from range(A) with ones from null(A).
-
The rank of A is the maximal number of linearly independent rows
(or columns) in A. A square
matrix A is nonsingular if and only if it is full rank, i.e., rank(A) = n.
When A is m x n with m > n and A is full rank, it has a trivial null space:
null(A) = {0}, where 0 is the n-vector of all zeros.
- The residual vector r = b − Ax is used heavily in least squares
(and in iterative methods for nonsingular linear systems). Technically the
residual is a function of x: r(x) = b − Ax, but that dependence is rarely
explicitly stated in functional form. If r = 0, then the LLS problem has been
solved, because then b = Ax. For LLS problems the residual is almost never
zero, and instead a different characterization is used to determine
optimality of a given x.
-
An overdetermined system has m > n
-
An underdetermined system has m < n
-
When m < n or null(A) is nontrivial,
sometimes the minimum norm least squares solution is desired.
An underdetermined system has an infinite number of solutions of the form
x0 + z, where x0 is any particular vector that minimizes
|| Ax − b ||2 and z ∈ null(A). This makes sense because
A*z = 0, so adding z to x0 has no effect on the residual:
r0 = b − A*x0 = b − A*(x0 + z)
So when null(A) contains a nonzero vector z, x0 + z,
x0 + (10100)*z, and x0 −
(21526766112879430287954321)!*z
also solve the LLS problem.
Since a least squares solution vector x could be arbitrarily large in size
(and you have to admit, (21526766112879430287954321)! is a pretty good
example of "arbitrarily large")
for underdetermined systems, to prevent overflow the least squares solution
of smallest two-norm among all possible solutions x0 + z, z ∈ null(A),
is sought. That particular x is unique.
Some not-necessarily mathematical observations:
- As already noted, most least squares problems in applications are overdetermined (i.e., m > n).
- For underdetermined systems, in computational practice all that is
really needed is a solution x that is bounded in size, and not necessarily the
unique one of minimal norm. The issue is loss of precision in a floating point
representation, and if a solution x has a norm 100 times larger than the
minimum norm solution, that is almost always acceptable with 64-bit double
precision numbers.
On the other hand, a solution x with norm
(21526766112879430287954321)!
times as large as the minimal one is probably not going to fly. A rough rule is that the
norm can be a factor of 100*m*n larger than the minimal one. This rule has no
basis whatsoever in mathematical analysis or a lifetime of observing what
happens in least squares problems from a broad range of scientific apps, so
don't take it as divine revelation (or even as a Satanic revelation).
- Choosing to minimize the two-norm of the residual vector r = b − Ax is
sometimes an abitrary choice. For mathematicians, one advantage of the two-norm is
that when it is squared, the resulting function of x is differentiable and so the math works
out easier. In many applications minimizing the 1-norm or the ∞-norm is a significantly
better choice.
- Geometrically, a least squares problem corresponds to finding a vector x so
that the corresponding residual r = b −Ax is orthogonal to the columns of
A. That characterization gives the normal equations:
ATr = AT(b − Ax) = 0
or equivalently, ATA = ATb.
- The normal equations give the "different characterization" for the residual vector
mentioned above: A vector x solves LLS if and only if it satisfies the normal equations
(and so, iff ATr = 0).
Next: ways of solving LLS
Last Modified: Mon 04 Nov 2019, 07:20 AM