Crawled NNDB to retrieve 11,800+ names. Example of the page that was crawled: http://www.nndb.com/lists/493/000063304/.
After compiling the list, each name was cleaned of punctuation, non-English
letters and converted to lower case.
The list below displays examples of the
"cleaning" process.
| Name | ID |
| a a milne |
1 |
| a bruce bielaski | 2 |
| a e housman | 3 |
| a e van vogt | 4 |
| a edward newton | 5 |
| a j foyt | 6 |
| a j liebling | 7 |
| a j mclean | 8 |
| a owsley stanley | 9 |
| a p j kalam | 10 |
| a whitney brown | 11 |
| aaliyah | 12 |
| aamir khan | 13 |
| ... | ... |
We randomly selected 100 names from the list of 11,800+ names.
The list below displays the beginning and ending names in our study set.
| Name | ID |
| abraham benrubi | 41 |
| adrian lyne | 94 |
| albert schweitzer | 219 |
| alvaro uribe | 365 |
| angela lansbury | 510 |
| anna lindh | 556 |
| ariel durant | 675 |
| audrey hepburn | 794 |
| bam margera | 835 |
| barbara o neil | 864 |
| ... | ... |
| ron wood | 9832 |
| roy innis | 9908 |
| rufus putnam | 9944 |
| sepp dietrich | 10198 |
| shelby lynne | 10260 |
| sofia vergara | 10384 |
| sylvester stallone | 10695 |
| ted koppel | 10755 |
| terrell owens | 10795 |
| tom wopat | 11099 |
| tommaso campanella | 11100 |
| toots thielemans | 11159 |
| vijay singh | 11348 |
| vince mcmahon | 11362 |
| wayne wang | 11527 |
| zane grey | 11874 |
For each name we performed a query using the Google Web API like the following:
http://www.google.com/search?hl=en&q=abraham+benrubi
We retrieved the top ten URLs for each person in our study set.
This set of 1,000 pages is considered the seed set of pages that built
our vicinity graph of 53,532 pages.
After creating our retrieve set, the next task we accomplished was to
built our NNDB Collection.
We crawled NNDB to get the home page for each person in our study
set.
Below is an example to the home page of one of our people:
http://www.nndb.com/people/147/000024075/.
We parsed this page for all NNDB outlinks and used the Google Web API to gather
all NNDB inlinks. The pages inlinks, home page, and outlinks for a particular
person were grouped together to be used as the comparison information against
our retrieve set.
| Retrieve Set vs. NNDB Collection Comparision |
![]() |
For both the retrieve set and the NNDB collection, we parsed the pages by stripping HTML code and performing the same modifications that were done to "clean" the list of NNDB names. We parsed the pages to collect term and name occurrence information. For each web page we stored information just like the following:
| Term | Occurrence |
| a |
46 |
| about | 3 |
| absolute | 1 |
| action | 1 |
| active | 1 |
| added | 1 |
| after | 4 |
| against | 1 |
Similarly, we counted the number of NNDB name occurrences for each page:
| Name | Occurrence |
| dick vitale |
1 |
| john madden | 1 |
| vijay singh | 1 |
We kept track of each page that was associated with a person so that we could total the term and name occurrence information. From this information we were able to calculate the TDIDF using both terms and name occurrences for the group of 100 people in our study set.
Example of the TFIDF name occurrence information stored for one of the people in the study set:
| Name | TFIDF Value |
| a edward newton |
0.03196037015356512 |
| aaron burr | 0.4869060267550306 |
| aaron neville | 0.01415243384770792 |
| abner louima | 0.6339197437903559 |
| abraham lincoln | 0.0018007639047847943 |
| ... | ... |
Similar information was captured for each person in the study set using terms:
| Name | TFIDF Value |
| a |
4.2812067984224346E-5 |
| aa | 5.491060242742253E-5 |
| aaa | 6.114144551654769E-4 |
| aac | 0.011460631497106296 |
| aacut | 8.935352460303257E-5 |
| aafaa | 1.9609866681406571 |
| aaland | 4.61512051684126 |
| ... | ... |
Using the information from the retrieve set, we calculated the cosine similarity between two people (for both name occurrences and terms) in the study set and compared this with the information from the NNDB collection.
Additionally, we used the name occurrences to determine the similarity between name occurrences in the retrieve set and those found in the NNDB collection.