Matlab makes computing the singular value decomposition (SVD) of a text-document matrix easy, but that just means the bottleneck of hard work is moved elsewhere. In this case, into the text preparation and file handling. This assignment is really to teach those two aspects.
The first task is the preprocessing needed before a latent semantic analysis can be done on a collection of text documents. For this LSI problem, it means creating the two m-files "docs.m" and "dictionary.m" that are used in the example given. We'll push out from that task to actually using LSI/SVD to analyzing the documents as a second part later (as Excercise 4); for now let's first concentrate on learning how to manipulate text and files from within Matlab. Go ahead and download the tar-ball manpage_example.tar, and uncompress it somewhere so you can look through it while reading this page.
`-- manpage_example
|-- LSI
| |-- blank_split.m
| |-- dictionary.m
| |-- docs.m
| |-- findrows.m
| |-- getdoc.m
| `-- lsi.m
|-- create_dict
|-- create_docs
|-- dictionary
|-- docs
| |-- arch.def
| |-- ash.def
| |-- awk.def
| |-- basename.def
. . .
. . .
. . .
| |-- vimtutor.def
| |-- ypdomainname.def
| `-- zcat.def
|-- manpages
| |-- arch
| |-- ash
| |-- awk
| |-- basename
. . .
. . .
. . .
| |-- vimtutor
| |-- ypdomainname
| `-- zcat
|-- setup_manpages
| |-- create_manpage_files
| `-- list_of_manpages
`-- stop_words
This way, when the analysis code LSI/lsi.m is run, the word-count
documents themselves are known to be "one directory up, over in subdirectory
docs".
You can do this in one single step, but may find it easier to break down into substeps, and have preprocess.m simply invoke those in turn. E.g., my preprocess.m basically looks like
cf = check_files;
if (cf == 0)
error('Something bad happened in check_files.m');
end
wc = create_wordcounts;
if (wc == 0)
error('Something bad happened in create_wordcounts.m');
end
cd = create_dictionary;
if (cd == 0)
error('Something bad happened in create_dictionary.m');
end
although I do have a bit more on the error-checking than shown here. I also have split out the create_wordcounts.m into a step of calls to subfunctions, pretty much corresponding to the process described above. All I require from you is the preprocess.m script that can be invoked and will automatically set up the required files. You are free to either do it all in one big honking step or in many smaller ones (which is really what I strongly recommend).
Before you dive in, however, try some smaller steps. Like just write a small script that will open a file, read in all of the stuff in it, and then write it back out to a different file.
Notice that in the diary files I periodically would enter the command "ccc". This is a home-brewed one that I have in my Matlab startup.m file. It basically cleans up everything, including compiled functions, etc.:
warning('off', 'all');
clear all;
clear global;
clear functions;
clear java;
clear classes;
warning('on', 'all');
hold off;
close all;
clc;
I turn off warnings because if you try to clear something like global variables and there aren't any, Matlab prints a warning. If you do the same, always be sure to turn it back on like I do above. You can use the same trick for places in your codes where you get an ignorable warning message.
A diary file and the script files used in class on 17 and 19 March. The names are admittedly cryptic but I wanted ones that I could not foul up when typing while standing up. Also, I added a few comments to them to help explain what they are for.
Another script, e.m , I wrote while helping someone from class debug things, and it might be helpful to others. If you want to download these, here is a single tarball of the scripts. It is named "19March2008.tar" so if your OS doesn't like file names that start with a number, rename it on download.