Since finishing my PhD at the end of the Summer (2010), I have been working as a
post-doctoral researcher at the
Data to Insight Center directed
by
Dr. Beth Plale. The projects at D2I in which I am
involved focus on an interdisciplinary approach to preservation of scientific data through metadata and provenance.
Check out the exciting projects we are working on!
In August of 2010 I defended my dissertation, "An Adaptable Repository for Complex Scientific Metadata" which shows
how the XML metadata schemata used to describe scientific data differ from general XML and how these differences can be
exploited to capture, discover, and manage the metadata that describes scientific data.
Since shortly after starting my Ph.D., I have been working with as a research assistant in Dr. Plale's lab. The Projects
section below contains details about some of the research projects I have been working on. The Publication and
Presentation sections provide information on peer-reviewed publications and also other presentations
I have done regarding my research.
My research focus is on metadata capture, management, reducing the misalignment of incentives for metadata
capture, applying data across domains, data provenance, provenance of network measurements, data grids,
services and SOA, XML, XML-Relational storage, and RDF. My dissertation work focused on identifying the
characteristics of XML-based metadata and differences from general XML storage that can be exploited to provide
faster query response in searching for e-Science data while using a flexible, scalable, and adaptable generic relational data
structure that can be applied to varied scientific domains using different metadata schemas and data hierarchies.
XML Metadata Concept Catalog (XMC Cat)
My research has focused on identify characteristics of scientific metadata schemas and how
those characteristics can be exploited in cataloging metadata to provide end-users the ability to
easily compose and execute complex queries over domain metadata (without needing to learn SQL, XPath
or XQuery). This would increase the ability of scientists to discover and reuse data when contrasted with
the keyword search capabilities currently used in many scientific portals. However, a
second conflicting goal is to have a loose coupling between the domain-specific XML metadata schema
and the database schema used in the metadata catalog. This loose coupling is needed for a metadata
catalog framework to be deployable in a diversity of scientific domains through configuration instead
of code customization.
Additionally, in XMC Cat we are looking to reduce the incentive misalignment between those who can generate
metadata and those who will benefit from it by capturing it during the scientific process to both increase
the value of the metadata to the researcher generating the data as well as reducing the cost of capturing
metadata through automation.
As a first step towards this goal of a configurable metadata catalog based on domain metadata schemas,
in the Spring of 2008 I rewrote the myLEAD metadata catalog using
Axis2 which allows
it to be a lighter weight service than our previous software stack and allows greater
flexibility in configuring the web service used for the metadata catalog.
This on-going effort is the XML Metadata Concept Catalog.
As the volume of scientific data increases, a number of researchers have noted the need to capture metadata
automatically. This automated metadata capture needs to be done based on the metadata schema of the
domain in which a metadata catalog is deployed. In XMC Cat this is addressed by allowing plugins to be
registered which will do additional domain-specific harvesting of metadata from files being added to
the metadata catalog. This additional harvesting can be done asynchronously to prevent a performance cost
in adding files to the metadata catalog.
Link to prior version of the XMC Cat web page
Linked Environments for Atmospheric Discovery (LEAD)
LEAD is a multi-institution Large ITR research project that brings together computer scientists,
meteteorological researchers, and meteorology educators in a collaborative effort. Through the LEAD
portal, researchers can search for data, compose complex forecasting workflows, and review their experiments.
My research in the LEAD project has focused on the myLEAD metadata catalog that allows meteorological researchers
to store metadata regarding data, ongoing experiments and research results and easily create complex
queries over their workspace. A hybrid XML-Relational approach is used to store the metadata that is communicated
using the LEAD Metadata Schema which is a profile for the FDGC schema for spatial data.
The first Alpha
release of myLEAD was in May of 2005, followed by version 1.2 in the Spring of 2006 and
version 1.3 in August of 2007.
Relational Grid Resources (RGR)
In this project we developed a synthetic workload based on the GLUE schema for measuring the performance
of different server patforms (relational, XML, and LDAP) for storing metadata about resources in a grid
environment.