Information Dependencies in Databases
and Data Mining
0082407
Principal Investigator
Edward
L
Robertson
Computer Science Department ,
Indiana University
Lindley Hall 215
Bloomington
Indiana
47405
812-855-4954, FAX:
812-855-4829
edrbtsn@indiana.edu,
www.cs.indiana.edu/~edrbtsn
Co-PIs
Dirk
Van Gucht
Computer Science Department ,
Indiana University
Lindley Hall 215
Bloomington
Indiana
47405
812-855-6429, FAX:
812-855-4829
vgucht@indiana.edu,
www.cs.indiana.edu/~vgucht
Memhet (Memo)
Dalkilic
School of Informatics ,
Indiana University
Bloomington
Indiana
47405
812-855-4954, FAX:
812-856-3010
dalkilic@indiana.edu,
www.cs.indiana.edu/~dalkilic
Collaborators
Richard
Martin
Tinwisle Corp.
205 N. College Ave.
Bloomington
Indiana
47408
812-332-5453
richardm@tinwisle.com
Chris
M
Giannella
Computer Science Department ,
Indiana Unviersity
812-855-4318
, FAX: 812-855-4829
cgiannel@indiana.edu,
www.cs.indiana.edu/~cgiannel
Bassem
Sayrafi
Computer Science Department,
Indiana Unviersity
812-855-4318
, FAX: 812-855-4829
bsayrafi@indiana.edu,
www.cs.indiana.edu/~bsayrafi
Sun
Kim
School of Informatics,
Indiana University
(812)-856-5754
, FAX: (812) 856-0999
sunkim@bio.informatics.indiana.edu,
bio.informatics.indiana.edu/sunkim
Keywords
information theory ,
information dependency measure,
functional and multivalued dependency,
entropy,
information structure,
datamining
Project Summary
This project focuses on Information Dependency (InD) measures and the
application of these measures to databases and datamining. InD
measures use classical (Shannon) information theory to evaluate the
information structure of database relations. This work extends
results by the investigators of this project which show how InD
measures generalize concepts important in database design, namely
functional and multivalued dependencies.
Research in this project is taking place across the spectrum from
theory to practice. On the theoretical side, deeper details of InD's
are investigated with an eye toward mechanisms for manipulating and
applying InD measures. On the theoretic side, properties of InD's are
investigated with an eye toward manipulating and applying InD
measures, as well as toward implications of InD's on modeling. In the
center, techniques for computing the measures are being investigated.
Because the ultimate goal of datamining is to inform the user,
investigations also include the interaction of InD and visualization.
On the applied side, the major focus is the application of InD
measures on data mining. Recognizing that research into applications
requires real rather than "toy" targets, this project seeks
collaborations involving data mining: the first such collaboration
being with researchers in Biology.
All of the activities of this project ultimately lead toward the
development of prototype toolkit components based on InD measures.
Publications and Products
-
Chris M. Giannella and Edward L. Robertson,
"On Approximation Measures for Functional Dependencies",
Information Systems, to appear.
-
Chris M. Giannella and Edward L. Robertson, "A Note on Approximation
Measures for Multi-valued Dependencies in Relational Databases",
Information Processsing Letters, p. 153, vol. 85, (2003).
-
Chris M. Giannella and Edward L. Robertson, "What's Been Lost in a
Lossy Join Decomposition?", submitted.
-
Chris M. Giannella, "An Axiomatic Approach to Defining Approximation
Measures for Functional Dependencies", Lecture Notes in Computer
Science, p. 37, vol. 2435, (2002).
-
Edward Robertson and Catharine Wyss, "Optimal Tuple Merge is
NP-complete", submitted.
-
Catharine Wyss and Edward Robertson, "Relational Interoperability",
submitted.
-
Chris Gianella, Memhet Dalkilc, Dennis Groth, and Edward Robertson,
"Using Horizontal-Vertical Decompositions to Improve Query
Evaluation", Lecture Notes in Computer Science, p. 26, vol. 2405,
(2002).
-
I. Gunduz, S. Zhao, M. Dalkilic and S. Kim, "Motif Discovery from
Large Number of Sequences: A Case Study with Disease Resistance Genes
in Arabidopsis thaliana", 2003 International Multiconference in
Computer Science and Computer Engineering.
-
Bassem Sayrafi and Chris Giannella, "An Information Theoretic
Histogram for Single Dimentional Selectivity Estimation", submitted.
-
Paul Purdom, Dirk Van Gucht and Dennis Groth, "Average Case
Performance of the Apriori Algorithm", SIAM Journal of Computing,
submitted.
-
Hari Berkovici and Dirk Van Gucht, "An inequality for mixed
Lp-norms", Journal of Mathematical Inequalities, submitted.
-
Bassem Sayrafi, Dirk Van Gucht and Marc Gyssens, "Measures in
Databases and Data Mining", Information Sciences, submitted.
-
Richard Martin and Edward Robertson,
"A Comparison of Frameworks for Enterprise Architecture Modeling",
ER2003 - International Conference on Conceptual Modeling, to appear.
-
Edward Robertson,
"An Algebra for Triadic Relations",
submitted.
Project Impact
In general we have found that, as expected, information dependency measures
are valuable for characterizing the general "landscape" of
a relation instance.
They are potentially valuable in drill-down situations as well,
where a specific investigation makes the computational cost worthwhile.
More interesting are the indirect insights, such as new approaches
to table decomposition; insights which do not directly depend upon
the measures but are suggested by considering aspects of the measures.
As a result of this project, two PhD students finished last summer
and a third will finish during the current year.
One MS student has switched to the PhD program.
Several MS students have performed experiments relating this topic.
Lecture materials for undergraduate classes have been developed, covering
background and
current research results
Goals, Objectives and Targeted Activities
- relate information dependency measures to other structural concepts,
such as Kivenen & Mannila's approximate functional dependencies
- study how new relational decomposition patterns,
motivated by the above, impact such areas as query processing
- investigate the interaction between information dependency measures
(and other variations with an information-theoretic flavor)
and the operations of relational algebra
- study whether information dependency measures can help characterize
standard benchmarks
- investigate whether an information-theoretic
perspective can facilitate database visualization,
characterization of "interestingness", etc.
Area Background
Just as traditional information theory investigates the consequences
of knowing the "surprisingness" of receiving a particular message,
this work looks at measures based on the "surprisingness" of finding
a particular value in a database table.
These measures are especially significant when considered across
multiple columns of a table, allowing us to phrase question such
as "If we know the gender of an employee,
does that provide any information about
job_classification?"
Questions such as this are generalized to information dependency
[InD] measures.
Requiring that an InD measure satisfy some particular arithmetic
constraint may have interesting consequences on the structure of
the underlying relation instance.
For example, requiring that a particular measure equal 0 forces
a functional dependency in the relation -
knowing that a functional dependency exists has important consequences
concerning data quality and redundancy.
Furthermore, arithmetic inequalities that must always hold between
various InD measures in a relation generalize correspondences between
various functional dependencies and the like.
Area References
T. Cover and J. Thomas, Elements of Information Theory,
John Wiley & Sons, New York, 1991.
M. Dalkilic & E. Robertson, "Information dependencies",
2000 PODS, ACM, May 2000.
J. Kivinen & H. Mannila,
"Approximate inference of functional dependencies from relations",
Theoretical Computer Science, 1995.