Exercise 3, Computer Science A321


Latent semantic indexing (LSI) in Matlab

Matlab makes computing the singular value decomposition (SVD) of a text-document matrix easy, but that just means the bottleneck of hard work is moved elsewhere. In this case, into the text preparation and file handling. This assignment is really to teach those two aspects.

The first task is the preprocessing needed before a latent semantic analysis can be done on a collection of text documents. For this LSI problem, it means creating the two m-files "docs.m" and "dictionary.m" that are used in the example given. We'll push out from that task to actually using LSI/SVD to analyzing the documents as a second part later (as Excercise 4); for now let's first concentrate on learning how to manipulate text and files from within Matlab. Go ahead and download the tar-ball manpage_example.tar, and uncompress it somewhere so you can look through it while reading this page.

Preprocessing procedure to create the LSI example is:

  1. Extract the help pages ("man pages" in Unix-speak) from the system and copy them to a local directory. That is already done and you can use the resulting documents in the subdirectory "manpages". This depends too much on a particular OS (Mac, Windows, Unix) for it to be generalized, and in any real case you'll want to use this on data from something completely different.
  2. For each file "fname" in subdirectory "manpages",
    1. filter out all non-alphabetic characters
    2. convert all upper-case letters to lower-case, since capitalization has little to do with the semantic meaning of a word
    3. remove some common "stop words", listed in file stop_words
    4. get the number of times each unique word appears. Do a "help unique" in Matlab to see how that can be done.
    5. extract the 40 most frequently appearing words (by sorting on the counts)
    6. print to a file "fname.def" in the subdirectory "docs" the counts and words, sorted alphabetically by word
  3. Create the file "dictionary", which contains all of the words that appear in any of the fname.def files. Again use "sort" and "unique" commands to have the dictionary in alphabetical order and with no repetitions.
  4. This stage can be combined with the last step, or done separately: Create an m-file "dictionary.m" that sets up the dictionary as in the example provided here. So you can either start with the list of words given in file "dictionary", and then create "dictionary.m", or as part of step 3 directly create "dictionary.m". [My advice: break it down into steps.]
  5. Create the m-file docs.m, which has the list of documents and the location of their .def files. You can do this using the "ls" command in Matlab, probably in some form like sl = ls('../docs') or sl = ls('..\docs')
Notice that the last item assumes a certain directory structure. We can generalize this for other directory structures as needed, but to make things easier for starting, go ahead and assume something like in the example given:
`-- manpage_example
    |-- LSI
    |   |-- blank_split.m
    |   |-- dictionary.m
    |   |-- docs.m
    |   |-- findrows.m
    |   |-- getdoc.m
    |   `-- lsi.m
    |-- create_dict
    |-- create_docs
    |-- dictionary
    |-- docs
    |   |-- arch.def
    |   |-- ash.def
    |   |-- awk.def
    |   |-- basename.def
    .   .   .         
    .   .   .         
    .   .   .         
    |   |-- vimtutor.def
    |   |-- ypdomainname.def
    |   `-- zcat.def
    |-- manpages
    |   |-- arch
    |   |-- ash
    |   |-- awk
    |   |-- basename
    .   .   .   
    .   .   .   
    .   .   .   
    |   |-- vimtutor
    |   |-- ypdomainname
    |   `-- zcat
    |-- setup_manpages
    |   |-- create_manpage_files
    |   `-- list_of_manpages
    `-- stop_words
This way, when the analysis code LSI/lsi.m is run, the word-count documents themselves are known to be "one directory up, over in subdirectory docs".

What you need

You should create a Matlab script called "preprocess.m" that carries out all of the steps that create docs.m and dictionary.m, only assumming that those are to be created for all of the files in subdirectory "manpages".

You can do this in one single step, but may find it easier to break down into substeps, and have preprocess.m simply invoke those in turn. E.g., my preprocess.m basically looks like

cf = check_files;
if (cf == 0) 
    error('Something bad happened in check_files.m');
end

wc = create_wordcounts;
if (wc == 0) 
    error('Something bad happened in create_wordcounts.m');
end

cd = create_dictionary;
if (cd == 0) 
    error('Something bad happened in create_dictionary.m');
end

although I do have a bit more on the error-checking than shown here. I also have split out the create_wordcounts.m into a step of calls to subfunctions, pretty much corresponding to the process described above. All I require from you is the preprocess.m script that can be invoked and will automatically set up the required files. You are free to either do it all in one big honking step or in many smaller ones (which is really what I strongly recommend).

Before you dive in, however, try some smaller steps. Like just write a small script that will open a file, read in all of the stuff in it, and then write it back out to a different file.

Handin

Hand in all of the m-files needed to run your preprocess.m. I'll test it first by using the same data set in the glob of files for the assignment, then will test it again with a few other files added. So be sure to not hardwire in any assumptions about the number of docs or the number of distinct words!

Addendum

Here are a few files quasi-related to this assignment.

Notice that in the diary files I periodically would enter the command "ccc". This is a home-brewed one that I have in my Matlab startup.m file. It basically cleans up everything, including compiled functions, etc.:

warning('off', 'all'); clear all; clear global; clear functions; clear java; clear classes; warning('on', 'all'); hold off; close all; clc;

I turn off warnings because if you try to clear something like global variables and there aren't any, Matlab prints a warning. If you do the same, always be sure to turn it back on like I do above. You can use the same trick for places in your codes where you get an ignorable warning message.

A diary file and the script files used in class on 17 and 19 March. The names are admittedly cryptic but I wanted ones that I could not foul up when typing while standing up. Also, I added a few comments to them to help explain what they are for.

Another script, e.m , I wrote while helping someone from class debug things, and it might be helpful to others. If you want to download these, here is a single tarball of the scripts. It is named "19March2008.tar" so if your OS doesn't like file names that start with a number, rename it on download.