its the end of glazed donut drought its - 0 0 0 0 0 0 the 7 - 4 5 0 0 0 end 4 5 - 0 0 0 0 of 3 4 5 - 0 0 0 glazed 1 7 3 4 - 0 0 donut 0 5 2 3 5 - 0 drought 0 3 1 2 4 5 -
Next a "meaning" vector is created for each word by concatenating its row vector and its column vector, thus including both the preceding and following contexts of the words. For example, in the example above, the meaning vector for the becomes [7, 0, 4, 5, 0, 0, 0, 0, 0, 5, 4, 7, 5, 3]
The meaning vectors can now be compared. For this we have to use a distance measure that is suitable for comparing vectors of different lengths. Cosine similarity is such a measure. Given our somewhat limited corpus of 8 word tokens and 7 word types, we don't get very interesting results. For example, the similarity between end and its, 0.542, is greater than the similarity between end and donut, 0.326.
To implement and test HAL, we need some data. For this, we will use the Brown corpus, which is accessible in NLTK.
To use the Brown corpus in NLTK, import it:
>>> from nltk.corpus import brownThen you can use the corpus method
words
to get the words in the corpus as a long list
of 1,161,192 tokens (punctuation is treated as words).
For more on using the corpora in NLTK, see
Chapter 2 of the NLTK book.
The list you will have consists of word tokens; you need to extract word types from this and then extract only the most frequent types from this. I suggest working only with words that occur at least 100 times. There should be 1092 of these.
Next for each of the word types, you need to create a meaning vector.
In my version, I created a class for the whole matrix and a class for each
word (though you don't have to do it this way).
The following initializes the matrix with a row and column for each word.
It also records where each word is, so that words' rows and vectors
can be looked up by their strings and the system will know which words are not
in the matrix at all (like 'python' in this example).
(This assumes that words
is bound to the list of frequent word types.
>>> memory = Memory(words) >>> memory.get_position('make') 273 >>> memory.get_position('python') -1
Next we need to record the co-occurrence statistics for the text, that is, the
whole original list of word tokens,
yielding something like what appears in the table above.
For each of the frequent word types, you find each of its occurrences in the text, and then
within the window of size WINDOW
before it, you record its co-occurrences with
other frequent word types, weighted by their distance from the word.
I used a WINDOW
of 5.
Unless you implement this in a clever way, it will take a very long time to go through all the
words, possibly hours, so you should start with a much smaller text while debugging your
program.
>>> memory.record(text) >>> memory.get_word('make') [0, 1, 1, 1, 2, 0, 0, 12, 0, 2, 0, 20, 0, 0, 0, 0, 0, 267, 5, 33, 20, 0, 1, 0, 1, 0, 27, 8, 1, 43, 0, 2, 22, 0, 0, 12, 0, 0, 23, 0, 0, 0, 0, 0, 13, 0, 3, 0, 4, 4, 15, 4, 0, 4, 0, 0, 0, 0, 0, 1, ...]
Finally we need to make the meaning vector for each word by concatenating its row vector and column vector in the matrix
>>> memory.make_meanings() >>> memory.get_word('make').meaning [0, 1, 1, 1, 2, 0, 0, 12, 0, 2, 0, 20, 0, 0, 0, 0, 0, 267, 5, 33, 20, 0, 1, 0, 1, 0, 27, 8, 1, 43, 0, 2, 22, 0, 0, 12, 0, 0, 23, 0, 0, 0, 0, 0, 13, 0, 3, 0, 4, 4, 15, 4, 0, 4, 0, 0, 0, 0, 0, 1, ..., 0, 0, 6, 0, 0, 5, 0, 17, 3, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 10, 111, 0, 0, 0, 50, 0, 0, 0, 8, 0, 0, 6, 0, 3, 0, 0, 0, 4, 0, 0, 1, 0, 0, 7, 8, 1, 6, 0, 2]
Now we can compare the meaning vectors for words to see how well the algorithm works. We would expect word pairs like 'man'/'woman' and 'a'/'the' to be relatively similar, for example. Here is a function you can use that returns the cosine similarity between two vectors.
def cosine(vec1, vec2): '''Cosine similarity between two vectors (sequences).''' len1 = math.sqrt(sum([x**2 for x in vec1])) len2 = math.sqrt(sum([x**2 for x in vec2])) dot = sum([x*y for x,y in zip(vec1, vec2)]) return dot / (len1 * len2)Of course how successful the
compare
function is depends on the amount of data that went into the computation.
>>> mem.compare('man', 'woman') 0.95833361171489606 >>> mem.compare('man', 'believe') 0.43955269305711103 >>> mem.compare('think', 'believe') 0.82563278902519588
Select 20 pairs of words to test, show how your program responds, and comment on the results.