Text processing and analysis in Matlab
The last project involves designing and building a full Matlab code
for a project from scratch. The underlying motivations are
- to follow a quasi-formal process for how to define
and develop a code from scratch,
- to see how to use Matlab for reading and analyzing
text or mixed text and numeric files,
- to learn how to use the singular value decomposition
(SVD) for analyzing text.
The singular value decomposition (SVD)
Under different names like principal component analysis (PCI) or
multi-dimensional analysis (MDA), the SVD is used widely in statistics
packages and for a surprisingly wide range of problems. It can be
used for categorization, cluster analysis, or information retrieval.
Applications described in class included trying to discern if early
homo habilis had some mental concept of a "perfect" hand-axe
when making stone tools, reducing the amount of data required to
specify state information in plasma physics, and page-ranking in
Web search engines.
Code design and development
The code design was done quasi-formally and interactively in class and
as a review the stages roughly were:
- First get an English statement of the problem to be solved, and what
the code is expected to do. In a more formal software methodology this is
the "requirements analysis" phase. For us, the statement was to explore
using low-rank approximations of a term-document matrix for text retrieval.
- Develop an English language statement of what steps should be taken
to accomplish the task.
- Break down the tasks further, and start defining what subtasks should
become functions or other m-files. As an example, being able to read in a
file containing text and process it further was needed in three places, so
that was a natural candidate for a separate function.
- Define the interfaces: What should the input and outputs of the
functions and scripts be? Only at this stage is it necessary to start
considering specifics in terms of Matlab capabilities. E.g., for a list
of strings, each containing the name of a document file that we need to
open and read, should the list be implemented as a character array or a
cell in Matlab? [Beware: formal software methodologies would typically
not consider specific languages until the next step, but for
most of what you will need in your lab work, this is the phase to start
thinking of what Matlab can and cannot do well.]
- Implementation: Make remaining decisions like cell versus array,
and write each function separately with its own driver that can
test the function without requiring all the rest of the system. [This is
called "unit factor testing" by software engineers.]
In a perfect world, you then write a small script that carries out the process
defined in the second step above, and declare victory. In practice, you will
need to debug and test the system, and will find gaps (or if you are lucky)
new research questions that can be answered by modifying or extending the code.
This means some iterating among the steps above.
Longer term, the code would evolve along with increased requirements or
new capabilities in your area of expertise. You may not intend to do
text analysis after this course, but as much as possible the components and
process should give you a head start when dealing with any data that
requires clustering/categorization/retrieval.
Project data specifics
The documents set up in text files are books from
Project Gutenberg.
If you want to find
how I selected, downloaded, and processed them
the procedure is documented, but you don't need to do that
preparation. Overall numbers are
- 80 documents (books)
- 1.2 million lines
- 11 million non-unique words
- 2.1 million unique words
Something about that last number should raise suspicions!
More quick facts:
- The file freqs is a word-count file for
the entire collection. It's been sorted in decreasing frequency of
appearance of words, and if multiple words have the same count, they
are further sorted into alphabetical order.
- The file wc has the number of lines,
words, and characters for each file, with a total at the bottom.
- The file titles has the title lines
extracted ... but beware that multiline titles have only the first
line present in it.
- Added on 15 Apr 2009: two of the books are encoded using
ISO-8859, which provides some European accent marks.
Those are 135.txt and 16119.txt. All the others are plain
ASCII encoded. If this causes problems, feel free to exclude
those two texts. Book 135.txt is "Les Misérables", which has
the French accent aigu in the internal title. The other has
the accent marks only in its text.
[Quasi-relevant info: your browser may or may not show accent marks
and more generally, foreign language alphabets like Cryllic, unless
you have explicitly added support for them. You can enter them in
HTML directly.]
The next step is to define the
procedure
the code should execute, followed by the
interface definitions.
Provided files include
the required data files and a handful of implementations of
functions and their test drivers.
More Stuff
History of this page:
- Started: 12 Apr 2009
- Modified: Tue 14 Apr 2009, 09:59 AM to include data file links.
- Modified: Tue 14 Apr 2009, 10:56 AM to include provided files
links.
- Modified: Fri 17 Apr 2009, 01:40 PM to mention the ISO files.
- Modified: Fri 24 Apr 2009, 05:04 PM to add help on computing
word counts.
- Modified: Fri 24 Apr 2009, 05:04 PM to include a link to
fileprocessing.html
- Last Modified: Fri 24 Apr 2009, 05:06 PM