B659 Project Recap

Crawled NNDB to retrieve 11,800+ names. Example of the page that was crawled: http://www.nndb.com/lists/493/000063304/.

After compiling the list, each name was cleaned of punctuation, non-English letters and converted to lower case.
The list below displays examples of the "cleaning" process.

Name ID
a a milne
1
a bruce bielaski 2
a e housman 3
a e van vogt 4
a edward newton 5
a j foyt 6
a j liebling 7
a j mclean 8
a owsley stanley 9
a p j kalam 10
a whitney brown 11
aaliyah 12
aamir khan 13
... ...

We randomly selected 100 names from the list of 11,800+ names.
The list below displays the beginning and ending names in our study set.

Name ID
abraham benrubi 41
adrian lyne 94
albert schweitzer 219
alvaro uribe 365
angela lansbury 510
anna lindh 556
ariel durant 675
audrey hepburn 794
bam margera 835
barbara o neil 864
... ...
ron wood 9832
roy innis 9908
rufus putnam 9944
sepp dietrich 10198
shelby lynne 10260
sofia vergara 10384
sylvester stallone 10695
ted koppel 10755
terrell owens 10795
tom wopat 11099
tommaso campanella 11100
toots thielemans 11159
vijay singh 11348
vince mcmahon 11362
wayne wang 11527
zane grey 11874

 

Creating the Retrieve Set

For each name we performed a query using the Google Web API like the following:
http://www.google.com/search?hl=en&q=abraham+benrubi
We retrieved the top ten URLs for each person in our study set.
This set of 1,000 pages is considered the seed set of pages that built our vicinity graph of 53,532 pages.

 

Creating the NNDB Collection

After creating our retrieve set, the next task we accomplished was to built our NNDB Collection.
We crawled NNDB to get the home page for each person in our study set.
Below is an example to the home page of one of our people:
http://www.nndb.com/people/147/000024075/. We parsed this page for all NNDB outlinks and used the Google Web API to gather all NNDB inlinks. The pages inlinks, home page, and outlinks for a particular person were grouped together to be used as the comparison information against our retrieve set.

Retrieve Set vs. NNDB Collection Comparision

 

Extracting Information from the Crawled Pages

For both the retrieve set and the NNDB collection, we parsed the pages by stripping HTML code and performing the same modifications that were done to "clean" the list of NNDB names. We parsed the pages to collect term and name occurrence information. For each web page we stored information just like the following:

Term Occurrence
a
46
about 3
absolute 1
action 1
active 1
added 1
after 4
against 1

Similarly, we counted the number of NNDB name occurrences for each page:

Name Occurrence
dick vitale
1
john madden 1
vijay singh 1

We kept track of each page that was associated with a person so that we could total the term and name occurrence information. From this information we were able to calculate the TDIDF using both terms and name occurrences for the group of 100 people in our study set.

Example of the TFIDF name occurrence information stored for one of the people in the study set:

Name TFIDF Value
a edward newton
0.03196037015356512
aaron burr 0.4869060267550306
aaron neville 0.01415243384770792
abner louima 0.6339197437903559
abraham lincoln 0.0018007639047847943
... ...

Similar information was captured for each person in the study set using terms:

Name TFIDF Value
a
4.2812067984224346E-5
aa 5.491060242742253E-5
aaa 6.114144551654769E-4
aac 0.011460631497106296
aacut 8.935352460303257E-5
aafaa 1.9609866681406571
aaland 4.61512051684126
... ...

Using the information from the retrieve set, we calculated the cosine similarity between two people (for both name occurrences and terms) in the study set and compared this with the information from the NNDB collection.

Additionally, we used the name occurrences to determine the similarity between name occurrences in the retrieve set and those found in the NNDB collection.