|
|
||||||
| May 2, 2005 | ||||||
|
The project recap page details our methodology for the project. The retrieve set consisted of 53,532 web pages that are associated with the 100 randomly selected people (these 100 people were labeled the study set) from NNDB. The NNDB set consisted of the 1,148 NNDB pages associated with the same 100 randomly selected people. After talking with Fil about what measurements would be appropriate, we decided to calculate the f-measure using three different similarity methods. The graph below shows the results of calculating the f-measure using co-occurrence of names, cosine similarity of name occurrences, and cosine similarity of term occurrences as the three methods. The graph displays the effect of filtering the graph weights to determine how this changes the f-measure. For example, a question might be made as how will removing edges that have a small weight (i.e. two nodes are weakly connected) affect the f-measure.
We analyzed the intersection of the retrieve and study sets to make sure that there was not a significant overlap. As you can see from the image below there were 331 pages in common between the two sets.
The following two dendrograms were produced using the Jaccard coefficient based on name occurrences: Dendrogram based on documents in the retrieve set.
Dendrogram based on documents in the NNDB set.
The following two dendrograms were produced using the cosine similarity based on name occurrences: Dendrogram based on documents in the retrieve set.
Dendrogram based on documents in the NNDB set.
The following two dendrograms were produced using the cosine similarity based on term occurrences: Dendrogram based on documents in the retrieve set.
Dendrogram based on documents in the NNDB set.
From the previous six graphs we can see how the different methods for determining similarity have clustered our study set differently. Unfortunately, none of the methods produce a measure of similarity that is considered extremely "close" as seen by the scale on Y-axis of the graph. Therefore, we decided to investigate further to see how NNDB clustered the study set. The image below shows how the study set has been grouped based on the categories designated on NNDB. There are a total of 84 categories provided on NNDB and our study set represents 34 of them. The image below also displays the fact that the "Actor" and "Musical" groups are the most represented groups with 18 people each. Additionally, there are 26 groups that have either one or two people associated with that category. We believe that because our study set represents 41% (34 / 83) of the total number of categories with less than 1% (100 / 11,800+) of the total people, the task of clustering these people may be more difficult than if we reduced the number of categories.
Below are two graphs showing the NNDB groups along with the groups produced when using the Jaccard coefficient dendrograms to create clusters. We chose to see what clusters existed when we sliced the dendrogram with 34 clusters (the same as the number of NNDB categories) to see how well the Jaccard coefficient clustered based on the NNDB categories. Comparison of NNDB groups vs. clusters created using Jaccard coefficient on retrieve set with name occurrences.
Comparison of NNDB groups vs. clusters created using cosine similarity on retrieve set with name occurrences.
Then, we compared how the group distribution of our study set compares with the distribution on NNDB. We see from the following two pie charts that we have very comparable distribution to NNDB:
| ||||||
|
|
||||||
| March 30, 2005 | ||||||
|
We are currently producing graphs that look like the following:
This graph is an example of the names that appeared from crawling the top ten pages from Google for one of our seed names. The graph is showing which names appeared in which pages. The "page-XXX" label indicates the page number, unique to our database, of the page. We are also working on totaling the terms for each person to measure some text similarities. |
||||||
|
|
||||||
| March 29, 2005 | ||||||
| We have currently crawled, parsed and indexed almost 54,000 web pages. From this information we have created retrieved all of the names that occur on each of these pages. Additionally, we have created our first graphs that display the co-occurrence of names on the same page. However, there are several pages that contain lots of names that even displaying the relationships from our seed set of 1,000 pages from Google is too clustered. Therefore, we are in discussions for determining what other relationships may be possible for this measurement. Our next step is to put together the graphs for other measurements. | ||||||
|
|
||||||
| March 3, 2005 | ||||||
| Ruj and I are currently working on the crawling process of the pages.
We have extracted over 11,500 names from NNDB.
The extraction of these names included crawling their website and then filtering
non-names from the list. After looking at the data, we are pretty pleased
with the results. There is definitely a wide range of names! Currently, we are using the BerkeleyDB Java Edition for data storage. This decision was made because it provided us with some flexibility that we may not be able to get from a relational databse. For example, some of the data that we are storing may not be represented efficiently using a relational database. We have also successfully implemented the Google API for retrieving the top pages on a random list of people. Our goal is to parse the text of each of these pages to retrieve the URLs and store the relevant words associated with each person. We also have our code retrieving the top inlinks for each of the retrieved URLs from Google. Jon is currently working on retrieveing all HTML links and text from the web pages. Our goal is to use this data to find associated names as well as calculate TFIDF values to look for similarities between the people we have searched. |
||||||
|
|
||||||
| January 25, 2005 | ||||||
Ruj and I met with Fil today to mention our project idea. Here are some
items we talked more about:
|
||||||
|
|
||||||
| January 24, 2005 | ||||||
Ruj and I met for today to continue brainstorming. Here are some items
we talked more about:
|
||||||
|
|
||||||
| January 21, 2005 | ||||||
Ruj and I met for the first time today on our project. We came up with
some ideas that involve finding people on the web and determining relationships
between them. However, there are some items that we need to dig deeper into
to help determine what exactly we are going to do. Here are some items that
we came up with during our initial brainstorming session:
|
||||||