A list of my publications is here.
Below is a description of research assistantships/projects in which I have been involved.


AFD Mining

Approximate Functional Dependencies (AFDs) are a probabilistic extension of the traditional concept of Functional Dependencies. AFDs are the formalization of rules which are some measured "closeness" to being a Functional Dependency. The search space for AFD mining is powerset lattices where the error threshold characterizes a boundary in the lattice of acceptable rules.
Mined AFDs are used in a wide variety of applications. They have been shown useful in bootstrapping Bayesian classifiers, query optimizers, query answering, and other applications.

Engineering a Framework
The Modular Lattice Search (MoLS) framework is based on a conceptual modularization of algorithms used to search powerset lattices where the framework's template handles generic aspects of the algorithm while plug-ins complete the template and allow the mining process to be customized. The use of plug-ins makes MoLS a mining tool which can be adapted to developers' needs instead of having to add pre/post mining modules.

A part of MoLS's development was designing the Rule Store, a portion of the framework which manages information learned during the search process. This includes the ability to infer the status of rules yet to be considered based on the known status of rules already explored. Inferencing allows rules to be considered but create savings by avoiding the calculation of the approximation value.

Algorithm Development
The Lozenge Search algorithm is an iterate-by-attribute template search algorithm. It provides a skeleton for searching and then allows traditional search algorithms, such as breadth or depth first search, to be plugged in for searching a specific lattice.

The Rule Store providing inferencing functionality motivated a new consideration of what algorithm is best suited for AFD mining. From this came the development depth first search (DFS) algorithms which considers a larger search space than the traditionally used breadth first search (BFS), but with inferencing DFS algorithms' provide savings by calculating fewer approximation values. The final step in this process was the development of AKD a heuristic guided DFS algorithm which consistently improves performance by orders of magnitude.

Performance
We have developed a number of depth first and breadth first algorithm variants. In testing we have found that DFS algorithms taking advantage of the inferencing abilities of MoLS improved performance by up to 90% in some tests. DFS algorithms consistently improved performance by 30 and 40%.



Conceptual Mapping approach to Data Integration

I worked with Cognitive Psychologist, Rob Goldstone from January 2008-Summer 2009 to extend ABSURDIST the concept mapping system he developed. ABSURDIST represents a conceptual system as a graph with nodes as concepts and relationships between concepts as edges in the graph. All relationships are based on a generic model which is adapted to the specifics of relationships within a given domain. This allows ABSURDIST to potentially be applicable whenever information can be represented as graphs of concepts. ABSURDIST uses an iterative global error optimization algorithm to perform graph matching between the two conceptual systems. We adapted ABSURDIST to two different data integration problems.

Schema Matching
The AbsMatcher system integrates two data sets by treating attributes as concepts and then trying to find correspondences between attributes in each data set. In order to do this AbsMatcher first automatically mines a graph for each data set with attributes as concepts. AbsMatcher's contribution is the relationships which are automatically mined to create a graph for each data set. Entropy relationships use the data to look for statistical patterns which judge too unique to be random occurrences. Semantic relationships use attribute names to query Yahoo!. Based the number of results for queries the semantic relatedness of two attributes is measured and used to create edges in the graph. These forms of relationships are interesting because they are broadly applicable and not dependent on data being in a specific format These graphs representing the relatedness of attributes are then combined with direct comparison of attributes in two systems during the graph matching process.

GIS Data Integration
Most recently we have been using ABSUDIST to test how robust it is for matching graphs based on GIS data. We mine graphs using Yahoo! Local Search of varying qualities and then test how well ABSURDIST can matches in the face of information degradation.



Machine Learning

Multiple Instance Learning in Cheminformatics

Over the summer of 2007, I did an RAship with Rajarshi Guha. I used the R programming language to implement the EM algorithm, an iterative global optimization algorithm which uses linear regression.We implemented two new features as improvements. The first was seeding the algorithm with a model based on aggregate data values in order to provide more consistent results compared to seeding with a model based on random values. The second improvement was to use noise to help avoid local minimums.