D1: Infant and Toddler First Aid D2: Babie's and Children's Room For Your Home D3: Child Safety at Home D4: Your Babie's Health and Safety: from Infant to Toddler D5: Baby Proofing Basics D6: Your Guide to Easy Rust Proofing D7: Beanie Babies Collectors Guide
T1: bab(y,ies,y's) T2: child(ren's) T3: guide T4: health T5: home T6: infant T7: proofing T8: safety T9: toddler
A = [... 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 ];
A = [...
0 5.7735e-01 0 4.4721e-01 7.0711e-01 0 7.0711e-01
0 5.7735e-01 5.7735e-01 0 0 0 0
0 0 0 0 0 7.0711e-01 7.0711e-01
0 0 0 4.4721e-01 0 0 0
0 5.7735e-01 5.7735e-01 0 0 0 0
7.0711e-01 0 0 4.4721e-01 0 0 0
0 0 0 0 7.0711e-01 7.0711e-01 0
0 0 5.7735e-01 4.4721e-01 0 0 0
7.0711e-01 0 0 4.4721e-01 0 0 0];
q1 = [0 1 0 0 0 0 1 0 0]';For "Child Home Safety", it is
q2 = [0 1 0 0 1 0 0 1 0]';Then compute the cosines of the angles between the vectors defining the documents, and the new query vectors. In matrix terms, this is looking at the values of A'*q, where q is the query matrix. Geometrically, this is finding the document vectors most closely aligned with the query vector, by choosing the ones with the largest cosines.
Where does SVD fit?
Note that if you perform the query of two terms "child" and "proofing", the
top two hits you get would includes D6: guide to easy rust proofing. That
is obviously a false hit. The problem is that we need to find documents
by their terms that appear in close association with the query terms, not
necessarily containing the exact same terms. The SVD accomplishes this.
It can be written as
Using Matlab, the decomposition is given by
The reason this is good news is that often a relatively small number of singular values/vectors are needed to capture most of the features in the matrix A. For some problems in fusion energy computations, k = 30 works well on a problem with m = n = 16384. That means the relevant information is compressed into (30 + 2*30*16384)/(16384*16384) = 0.37% as much space.
Notice that if an SVD has been created, people don't actually set up the reduced rank matrix Ak. If you do, then you've thrown away all of the data space savings since Ak is the same size as the original A. Instead the approximation is needed for things like finding the cosines of angles that its columns form to a given query vector q. In that case,
cosines = (Ak)T * qis computed in three stages, using (in Matlab notation)
cosines = V(:,1:k)*S(1:k,1:k)*U(:,1:k)'*q
The left singular vectors ui are generally dense vectors: (almost) entirely nonzeros. The first singular vector u1 is a linear combination of all the document vectors that best captures the information in all the columns of A. This density means that it entails (to greater or lesser extent) all of the terms.
Another way of looking at this is that the first terms in the summation form of the SVD gives the "signal", while the others (corresponding to the smaller singular values) are "noise", happenstance juxtapositions of terms, or misleading mispellings, etc. The full matrix A gives more or less equal weight to those, while using the SVD separates the important information (signal) from the misleading and irrelevant (noise).
You should explore just how good a "reduced-rank" approximation like Ak is to the original matrix for some sample term-document matrices.