B659 - Project Blog


May 2, 2005

The project recap page details our methodology for the project.

The retrieve set consisted of 53,532 web pages that are associated with the 100 randomly selected people (these 100 people were labeled the study set) from NNDB. The NNDB set consisted of the 1,148 NNDB pages associated with the same 100 randomly selected people.

After talking with Fil about what measurements would be appropriate, we decided to calculate the f-measure using three different similarity methods. The graph below shows the results of calculating the f-measure using co-occurrence of names, cosine similarity of name occurrences, and cosine similarity of term occurrences as the three methods. The graph displays the effect of filtering the graph weights to determine how this changes the f-measure. For example, a question might be made as how will removing edges that have a small weight (i.e. two nodes are weakly connected) affect the f-measure.

We analyzed the intersection of the retrieve and study sets to make sure that there was not a significant overlap. As you can see from the image below there were 331 pages in common between the two sets.

The following two dendrograms were produced using the Jaccard coefficient based on name occurrences:

Dendrogram based on documents in the retrieve set.

Dendrogram based on documents in the NNDB set.

The following two dendrograms were produced using the cosine similarity based on name occurrences:

Dendrogram based on documents in the retrieve set.

Dendrogram based on documents in the NNDB set.

The following two dendrograms were produced using the cosine similarity based on term occurrences:

Dendrogram based on documents in the retrieve set.

Dendrogram based on documents in the NNDB set.

From the previous six graphs we can see how the different methods for determining similarity have clustered our study set differently. Unfortunately, none of the methods produce a measure of similarity that is considered extremely "close" as seen by the scale on Y-axis of the graph. Therefore, we decided to investigate further to see how NNDB clustered the study set.

The image below shows how the study set has been grouped based on the categories designated on NNDB. There are a total of 84 categories provided on NNDB and our study set represents 34 of them. The image below also displays the fact that the "Actor" and "Musical" groups are the most represented groups with 18 people each. Additionally, there are 26 groups that have either one or two people associated with that category. We believe that because our study set represents 41% (34 / 83) of the total number of categories with less than 1% (100 / 11,800+) of the total people, the task of clustering these people may be more difficult than if we reduced the number of categories.

Below are two graphs showing the NNDB groups along with the groups produced when using the Jaccard coefficient dendrograms to create clusters. We chose to see what clusters existed when we sliced the dendrogram with 34 clusters (the same as the number of NNDB categories) to see how well the Jaccard coefficient clustered based on the NNDB categories.

Comparison of NNDB groups vs. clusters created using Jaccard coefficient on retrieve set with name occurrences.

Comparison of NNDB groups vs. clusters created using cosine similarity on retrieve set with name occurrences.

Then, we compared how the group distribution of our study set compares with the distribution on NNDB. We see from the following two pie charts that we have very comparable distribution to NNDB:
All NNDB Groups (84 groups) Study set NNDB Groups (34 groups)


March 30, 2005

We are currently producing graphs that look like the following:

This graph is an example of the names that appeared from crawling the top ten pages from Google for one of our seed names. The graph is showing which names appeared in which pages. The "page-XXX" label indicates the page number, unique to our database, of the page.

We are also working on totaling the terms for each person to measure some text similarities.


March 29, 2005
We have currently crawled, parsed and indexed almost 54,000 web pages. From this information we have created retrieved all of the names that occur on each of these pages. Additionally, we have created our first graphs that display the co-occurrence of names on the same page. However, there are several pages that contain lots of names that even displaying the relationships from our seed set of 1,000 pages from Google is too clustered. Therefore, we are in discussions for determining what other relationships may be possible for this measurement. Our next step is to put together the graphs for other measurements.

March 3, 2005
Ruj and I are currently working on the crawling process of the pages. We have extracted over 11,500 names from NNDB. The extraction of these names included crawling their website and then filtering non-names from the list. After looking at the data, we are pretty pleased with the results. There is definitely a wide range of names!

Currently, we are using the BerkeleyDB Java Edition for data storage. This decision was made because it provided us with some flexibility that we may not be able to get from a relational databse. For example, some of the data that we are storing may not be represented efficiently using a relational database.

We have also successfully implemented the Google API for retrieving the top pages on a random list of people. Our goal is to parse the text of each of these pages to retrieve the URLs and store the relevant words associated with each person. We also have our code retrieving the top inlinks for each of the retrieved URLs from Google.

Jon is currently working on retrieveing all HTML links and text from the web pages. Our goal is to use this data to find associated names as well as calculate TFIDF values to look for similarities between the people we have searched.

January 25, 2005
Ruj and I met with Fil today to mention our project idea. Here are some items we talked more about:
  • Using NNDB for our list of names. We can parse each of the twenty six alphabetic pages for the people that NNDB stores.
  • Start with a large seed of pages and then look at all neighbors that are of link distance 1 from these pages. This includes both in and out link distance 1.
  • We will therefore build a very simple crawler because we don't need a full implementation.
  • Next we discussed possible measurements (metrics) to consider for associations. We discussed using a combination of both link and text based measurements and possibly combining these. Here were some of our thoughts.
    • TF / IDF
    • Collapsing vs. keeping each of the pages distinct. We could do different measures based on considering all of the documents as one vs. keeping them separate. We discussed using a clustering coefficent to see how clustered all the pages are together.
    • Link analysis - co-sited, co-referenced.

January 24, 2005
Ruj and I met for today to continue brainstorming. Here are some items we talked more about:
  • Picking specific websites to search. For example, wikipedia and source watch.
  • Starting with N names and determining relationships between these people.
  • Additional Items
    • Should we do a general crawl and not restrict to specific domains?
    • Should our system be able to determine a name? For example, store first names and try to find names.
    • What measurements should our system keep? What are important metrics?
      • Names found
      • Associations found - how do measure what the actual value should be?
      • Do we need to keep an "actual" measure?
  • Associations
    • Still need to determine what establishes a relation.
    • Do we have a bound where if there are more than N occurrences of two names then a relation exists.
  • First attempt at our solution:

    We have started to split our solution out into separate modules based on the various tasks needed to implement our solution. Currently, we are planning on the following modules:
    • Crawler - We will use an existing crawler to retrieve the data needed from the various sites. Our first thought is to use InfoSpider.
    • Parser - Take the data that the crawler has retrieved and parse the data for names and other links.
    • Data Storage - Take the data that was retrieved from the parser and store the associations and URL information (frontier information as well as URLs visited).
    • Relation Data Structures - Need a conversion process that reads the data from the database and converts the data into a view that can be used by the data view module. Our goal is to have the ability to store the data in different manners and not to have to worry about changing the data view portion. Using a framework like Model-View-Controller (MVC) to handle the separation of data and view.
    • Data View - Actually display the relationships from the data that we have found.
  • Initial ideas for storing the data (if we use a relational database:


January 21, 2005
Ruj and I met for the first time today on our project. We came up with some ideas that involve finding people on the web and determining relationships between them. However, there are some items that we need to dig deeper into to help determine what exactly we are going to do. Here are some items that we came up with during our initial brainstorming session:
  • Determing a person in a web page or using an existing repository.
    • How do we determine a person?
    • Parse text - store a list of the N most common first names and try to find names. For example, have a list {George, John, Bill, Frank, ...} and then parse the text looking for names like George Bush, George Roberts, John Smith.
    • Have a list of M people we are looking for and see what relationships exist between them? In other words, have a hard-coded list and not worry about parsing for new names.
    • Use a directory page and search for relationships among those names. For example, use a page like http://dir.yahoo.com/Entertainment/Movies_and_Film/Genres/Classic_Hollywood/Actors_and_Actresses/ which list all of the classic Hollywood actors and actresses. We could use lists to find such people as government officials, sports stars, movie stars, etc.
  • Determine relationship between people. Once we have decided on how we are going to use names or collect names the next step is to determine a relationship between those people. Some thought will be needed to determine our criteria for a relationship. We have come up with the following initial ideas:
    • Distance from search name? If a name appears within K words from our search name then does this constitute a relationship?
    • Do we need to keep track of the number of times two names appear together in our search? Does a relationship exist if two names appear over a certain threshold?
    • Can a relationship be directional? For example, if I have posted on my webpage that Michael Jordan is my favorite basketball player of all-time then does this mean that I have a relationship with Michael Jordan and, if so, Michael Jordan does not have a relationship with me because he doesn't even know me.
  • Storing information. We will need to look into the following data storage items:
    • What we will use to store the information that we crawl.
    • How do we store relationships? Need to store information in a manner that will allow us to display the relationships easily.
  • User Interface. We will need to look into the following UI items:
    • How do we present relationships? Do we use an existing technology and tie in with their API or do we create own own user interface.
    • Do we have a user interface for the searching process? Does the crawler need a user interface?
  • Searching. We will need to look into the following searching items:
    • How do we find names? What if a person is referred to using a subject? For example, George Bush addressed the country today with…. He also spoke with Condoleezza Rice about ... We would need to determine that the "He" in the previous example refers to George Bush. Does our system need to have language parsing included?
    • How do we take care of names in various formats or used with titles? For example, President Bush; Mr. Bush; George Bush; George W. Bush; Bush, George; etc.
    • If we only store common first names then we would need search rules for determining names. For example, we have "Michael" in our list of first names. Then we need to determine that "Michael Jordan" is a name and see what relationships exist with the name "Michael Jordan".
  • Additional Items to think about:
    • Focused Crawler - Is our system a version of a focused crawler? If we let the system try to figure out what a name is then we can give feedback to the system on what is a good name and what is not. We can have another module of the system find relationships on names.
    • Need to determine the focus of our project. Finding relationships, finding names, etc.? What is too broad and what is too narrow?