Text processing and analysis in Matlab


The last project involves designing and building a full Matlab code for a project from scratch. The underlying motivations are

The singular value decomposition (SVD)

Under different names like principal component analysis (PCI) or multi-dimensional analysis (MDA), the SVD is used widely in statistics packages and for a surprisingly wide range of problems. It can be used for categorization, cluster analysis, or information retrieval. Applications described in class included trying to discern if early homo habilis had some mental concept of a "perfect" hand-axe when making stone tools, reducing the amount of data required to specify state information in plasma physics, and page-ranking in Web search engines.


Code design and development

The code design was done quasi-formally and interactively in class and as a review the stages roughly were:

  1. First get an English statement of the problem to be solved, and what the code is expected to do. In a more formal software methodology this is the "requirements analysis" phase. For us, the statement was to explore using low-rank approximations of a term-document matrix for text retrieval.

  2. Develop an English language statement of what steps should be taken to accomplish the task.

  3. Break down the tasks further, and start defining what subtasks should become functions or other m-files. As an example, being able to read in a file containing text and process it further was needed in three places, so that was a natural candidate for a separate function.

  4. Define the interfaces: What should the input and outputs of the functions and scripts be? Only at this stage is it necessary to start considering specifics in terms of Matlab capabilities. E.g., for a list of strings, each containing the name of a document file that we need to open and read, should the list be implemented as a character array or a cell in Matlab? [Beware: formal software methodologies would typically not consider specific languages until the next step, but for most of what you will need in your lab work, this is the phase to start thinking of what Matlab can and cannot do well.]

  5. Implementation: Make remaining decisions like cell versus array, and write each function separately with its own driver that can test the function without requiring all the rest of the system. [This is called "unit factor testing" by software engineers.]

In a perfect world, you then write a small script that carries out the process defined in the second step above, and declare victory. In practice, you will need to debug and test the system, and will find gaps (or if you are lucky) new research questions that can be answered by modifying or extending the code. This means some iterating among the steps above.

Longer term, the code would evolve along with increased requirements or new capabilities in your area of expertise. You may not intend to do text analysis after this course, but as much as possible the components and process should give you a head start when dealing with any data that requires clustering/categorization/retrieval.


Project data specifics

The documents set up in text files are books from Project Gutenberg. If you want to find how I selected, downloaded, and processed them the procedure is documented, but you don't need to do that preparation. Overall numbers are

Something about that last number should raise suspicions! More quick facts:

The next step is to define the procedure the code should execute, followed by the interface definitions. Provided files include the required data files and a handful of implementations of functions and their test drivers.

More Stuff


History of this page:
  1. Started: 12 Apr 2009
  2. Modified: Tue 14 Apr 2009, 09:59 AM to include data file links.
  3. Modified: Tue 14 Apr 2009, 10:56 AM to include provided files links.
  4. Modified: Fri 17 Apr 2009, 01:40 PM to mention the ISO files.
  5. Modified: Fri 24 Apr 2009, 05:04 PM to add help on computing word counts.
  6. Modified: Fri 24 Apr 2009, 05:04 PM to include a link to fileprocessing.html
  7. Last Modified: Fri 24 Apr 2009, 05:06 PM