Latent Semantic Indexing


This is intended as a brief set of key words; as indicated in class the example below is from the Berry and Brown book Understanding Search Engines. If you are interested in pursuing this material further buy the book - it's cheap like most SIAM books, and brief enough to give you a good survey of the field when it appeared. Also, Mike Berry maintains an extensive set of web pages on latent semantic indexing including links to his software. Going beyond just LSI, you might check the BOW Project and its set of software.

Intro

Because of the range of skills and needs in this course, I'm using the singular value decomposition (SVD) applied to text retrieval as an "application". It can (and does, at least for Google) have value in that context. But the material is presented here mainly as motivation for If the SVD itself doesn't seem crystal clear, that's alright. Hundreds of researchers are merrily using it to good effect with no better understanding. If Google really does use the SVD, then millions of people are using it daily without a clue that it's an underlying technology for web search. You mainly just need to be clear on how to extract the first k columns of an MxN matrix, or the diagonal entries of an array, to make things work here.

Example

The example in class had as documents the titles of books from a query run on baby-proofing a house:
D1: Infant and Toddler First Aid
D2: Babie's and Children's Room For Your Home
D3: Child Safety at Home
D4: Your Babie's Health and Safety: from Infant to Toddler
D5: Baby Proofing Basics
D6: Your Guide to Easy Rust Proofing
D7: Beanie Babies Collectors Guide

The terms extracted from the documents were:
T1: bab(y,ies,y's)
T2: child(ren's)
T3: guide
T4: health
T5: home
T6: infant
T7: proofing
T8: safety
T9: toddler

Those comprise the "dictionary" that the example uses. The resulting term-document matrix results by having entry (i,j) the simple count of frequencies of term i in document j:
A = [...
0	1	0	1	1	0	1
0	1	1	0	0	0	0
0	0	0	0	0	1	1
0	0	0	1	0	0	0
0	1	1	0	0	0	0
1	0	0	1	0	0	0
0	0	0	0	1	1	0
0	0	1	1	0	0	0
1	0	0 	1	0	0	0 ];

Normalize each column to have 2-norm equal to 1.0, so that a lengthy document does not have more weight than a shorter one does; it is the relative frequency of terms that is needed:
A = [...
         0   5.7735e-01            0   4.4721e-01   7.0711e-01            0   7.0711e-01
         0   5.7735e-01   5.7735e-01            0            0            0            0
         0            0            0            0            0   7.0711e-01   7.0711e-01
         0            0            0   4.4721e-01            0            0            0
         0   5.7735e-01   5.7735e-01            0            0            0            0
7.0711e-01            0            0   4.4721e-01            0            0            0
         0            0            0            0   7.0711e-01   7.0711e-01            0
         0            0   5.7735e-01   4.4721e-01            0            0            0
7.0711e-01            0            0   4.4721e-01            0            0            0];

Given a new title (document), we want to find out how "close" it is to the documents defined by the columns of the term-document matrix. Form a vector of length equal to the number of terms, with entry i having a 1 if the term number i is present in the dictionary, 0 otherwise. Suppose the title is "Child Proofing". The corresponding term vector for that document is
   q1 = [0	1	0	0	0	0	1	0	0]';
For "Child Home Safety", it is
   q2 = [0  1   0   0   1   0   0   1   0]';
Then compute the cosines of the angles between the vectors defining the documents, and the new query vectors. In matrix terms, this is looking at the values of A'*q, where q is the query matrix. Geometrically, this is finding the document vectors most closely aligned with the query vector, by choosing the ones with the largest cosines.

Where does SVD fit? Note that if you perform the query of two terms "child" and "proofing", the top two hits you get would includes D6: guide to easy rust proofing. That is obviously a false hit. The problem is that we need to find documents by their terms that appear in close association with the query terms, not necessarily containing the exact same terms. The SVD accomplishes this. It can be written as

A = sum(si*ui*vi)

where The u's and v's are the columns of the SVD matrices U and V, resp. Here the singular values are ordered in decreasing order, so the first few terms contain the most weight.

Using Matlab, the decomposition is given by

[U, S, V] = svd(A);

where S is a diagonal matrix with the singular values on the diagonal in decreasing order, U is an mxm orthogonal matrix formed by putting the ui's together as columns into an array, and V is an nxn orthogonal matrix formed similarly by stacking the vi's as the columns of V. [You should immediately check the sizes of everything. If A is mxn, U is mxm, and V is nxn, what size is the diagonal matrix S?]

Optimality properties of the SVD

The SVD gives an optimal approximation to A if you cut off the summation above after a few terms. Define
Ak = U(:,1:k)*S(1:k,1:k)*[V(:,1:k)]' for some k < min(m,n).

Then Ak is the matrix X that minimizes the Frobenius norm of ||X - A|| over all matrices of rank-k or less. The SVD also puts a bound on it: the square of the error norm is bounded by the sums of squares of the neglected singular values. This means we can replace A with a significantly smaller matrix in performing queries, making them much faster. For example, the number of documents could be in the millions, while the SVD can give a good approximation for k less than a thousand.

The reason this is good news is that often a relatively small number of singular values/vectors are needed to capture most of the features in the matrix A. For some problems in fusion energy computations, k = 30 works well on a problem with m = n = 16384. That means the relevant information is compressed into (30 + 2*30*16384)/(16384*16384) = 0.37% as much space.

Notice that if an SVD has been created, people don't actually set up the reduced rank matrix Ak. If you do, then you've thrown away all of the data space savings since Ak is the same size as the original A. Instead the approximation is needed for things like finding the cosines of angles that its columns form to a given query vector q. In that case,

cosines = (Ak)T * q
is computed in three stages, using (in Matlab notation)
    cosines = V(:,1:k)*S(1:k,1:k)*U(:,1:k)'*q

The left singular vectors ui are generally dense vectors: (almost) entirely nonzeros. The first singular vector u1 is a linear combination of all the document vectors that best captures the information in all the columns of A. This density means that it entails (to greater or lesser extent) all of the terms.

Another way of looking at this is that the first terms in the summation form of the SVD gives the "signal", while the others (corresponding to the smaller singular values) are "noise", happenstance juxtapositions of terms, or misleading mispellings, etc. The full matrix A gives more or less equal weight to those, while using the SVD separates the important information (signal) from the misleading and irrelevant (noise).

You should explore just how good a "reduced-rank" approximation like Ak is to the original matrix for some sample term-document matrices.


  • First started: Tue Mar 9 14:31:43 EST 2004
  • Modified: Mon Mar 3 14:50:58 EST 2008