Information Dependencies in Databases and Data Mining

0082407

Principal Investigator

Edward L Robertson
Computer Science Department
, Indiana University
Lindley Hall 215

Bloomington
Indiana 47405
812-855-4954, FAX: 812-855-4829
edrbtsn@indiana.edu, www.cs.indiana.edu/~edrbtsn

Co-PIs

Dirk Van Gucht
Computer Science Department
, Indiana University
Lindley Hall 215

Bloomington
Indiana 47405
812-855-6429, FAX: 812-855-4829
vgucht@indiana.edu, www.cs.indiana.edu/~vgucht

Memhet (Memo) Dalkilic
School of Informatics
, Indiana University
Bloomington
Indiana 47405
812-855-4954, FAX: 812-856-3010
dalkilic@indiana.edu, www.cs.indiana.edu/~dalkilic

Collaborators

Hari Bercovici
Mathematics Department, Indiana University

Nele Dexters
Department Wiskunde-Informatica, Universiteit Antwerpen, Antwerp, Belgium

Chris M Giannella
Computer Science and Electrical Engineering
, University of Maryland, Baltimore County
(410) 455-8938

cgiannel@cs.umbc.edu, www.cs.umbc.edu/~cgiannel

Mark Gyssens
Departement Wiskunde-Informatica,
Universiteit Antwerpen, Antwerp, Belgium

Sun Kim
School of Informatics
, Indiana University
(812)-856-5754
, FAX: (812) 856-0999
sunkim@bio.informatics.indiana.edu
, bio.informatics.indiana.edu/sunkim

Richard Martin
Tinwisle Corp.
205 N. College Ave.
Bloomington Indiana 47408
812-332-5453 richardm@tinwisle.com

Paul W Purdom
Computer Science Department
, Indiana University
(812)-855-1507
, FAX: (812) 855-4829
pwp@indiana.edu
, www.cs.indiana.edu/pwp

Bassem Sayrafi
Computer Science Department
, Indiana University
812-855-4318
, FAX: 812-855-4829
bsayrafi@indiana.edu, www.cs.indiana.edu/~bsayrafi

John A. Springer
Computer Science Department
, Indiana University
812-855-4318
, FAX: 812-855-4829
jospring@indiana.edu, www.cs.indiana.edu/~jospring

Catharine M. Wyss
School of Informatics
, Indiana University
(812)-856-1099
, FAX: (812) 856-0999
cmw@indiana.edu
, www.cs.informatics.indiana.edu/~cmw

Keywords

information theory , information dependency measure, functional and multivalued dependency, entropy, information structure, datamining

Project Summary

This project focuses on Information Dependency (InD) measures and the application of these measures to databases and datamining. InD measures use classical (Shannon) information theory to evaluate the information structure of database relations. This work extends results by the investigators of this project which show how InD measures generalize concepts important in database design, namely functional and multivalued dependencies. Research in this project is taking place across the spectrum from theory to practice. On the theoretical side, deeper details of InD's are investigated with an eye toward mechanisms for manipulating and applying InD measures. On the theoretic side, properties of InD's are investigated with an eye toward manipulating and applying InD measures, as well as toward implications of InD's on modeling. In the center, techniques for computing the measures are being investigated. Because the ultimate goal of datamining is to inform the user, investigations also include the interaction of InD and visualization. On the applied side, the major focus is the application of InD measures on data mining. Recognizing that research into applications requires real rather than "toy" targets, this project seeks collaborations involving data mining: the first such collaboration being with researchers in Biology. All of the activities of this project ultimately lead toward the development of prototype toolkit components based on InD measures.

Publications and Products

  • Chris M. Giannella and Edward L. Robertson, "On Approximation Measures for Functional Dependencies", Information Systems, 29, 6, pp. 483-507.
  • Catharine Wyss and Edward Robertson, "Relational Languages for Metadata Information", ACM Transactions on Database Systems, to appear.
  • Richard Martin and Edward Robertson, ``A Comparison of Frameworks for Enterprise Architecture Modeling'', ER2003 - International Conference on Conceptual Modeling.
  • Edward Robertson, "Triadic Relations: an Algebra for the Semantic Web", Semantic Web and Databases '04 (forthcoming in Lecture Notes in Computer Science
  • Richard Martin, Edward Robertson, and John Springer. "Architectural Principles for Enterprise Frameworks", Knowledge and Model Driven Information Systems Engineering for Networked Organizations (EMMSAD '04), J. Grundspenkis and M. Kirikova (eds), 151-162.
  • Chris Giannella, Jiawei Han, Edward Robertson, and Chao Liu, "Mining Frequent Itemsets Over Arbitrary Time Intervals in Data Streams", Computer Science Department Technical Report 587, Indiana University, Nov 2003. An older version: C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu. "Mining Frequent Patterns in Data Streams at Multiple Time Granularities", Data Mining: Next Generation Challenges and Future Directions, AAAI/MIT Press, H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (eds.), 2003.
  • Mehmet Dalkilic and Arijit Sengupta, "A Logic-theoretic classifier called Circle", The Eighth International Conference on Control, Automation, Robotics, and Vision.
  • Mehmet Dalkilic and James Costello, "BioKnOT: Biological Knowledge through Ontologies and TFIDF", SIGIR Bio Sheffield'04.
  • Mehmet Dalkilic and Arijit Sengupta, "Semantic Thumbnails: A Novel Method for Summarizing Document Collections", SIGDOC '04.
  • Mehmet Dalkilic and Arijit Sengupta, "CATPA: Curation and Alignment Tool for Protein Analysis", submitted.
  • Bassem Sayrafi and Chris Giannella, "An Information Theoretic Histogram for Single Dimensional Selectivity Estimation", submitted.
  • Paul Purdom, Dirk Van Gucht and Dennis Groth, "Average Case Performance of the Apriori Algorithm", SIAM Journal of Computing, to appear.
  • Bassem Sayrafi and Dirk Van Gucht, ``Inference Systems from Additive Measures'', Proc. of the Workshop on Causality and Causal Reasoning (in Conjunction with the 17th Canadian Conference on Artificial Intelligence).
  • Hari Berkovici and Dirk Van Gucht, "An inequality for mixed Lp-norms", Journal of Mathematical Inequalities, to appear.
  • Bassem Sayrafi, Dirk Van Gucht and Marc Gyssens, "Measures in Databases and Data Mining", submitted.
  • Paul Purdom, Dirk Van Gucht, and Nele Dexters, ``Analysis of Candidacy-Based Frequent Itemset Algorithms'', submitted.
  • Paul Purdom, Bassem Sayrafi, Dirk Van Gucht, ``On the Effectiveness of Frequency Approximation Rules for the Frequent Itemsets Problem'', submitted.

Project Impact

In general we have found that, as expected, information dependency measures are valuable for characterizing the general "landscape" of a relation instance. They are potentially valuable in drill-down situations as well, where a specific investigation makes the computational cost worthwhile. More interesting are the indirect insights, such as new approaches to table decomposition; insights which do not directly depend upon the measures but are suggested by considering aspects of the measures. We have also discovered that information dependency measures are but a special case in a framework for measures in information systems. This framework allows for the general study of the properties of measures and how they may be applied to specific problems in databases and data mining. Of particular focus is the applicability of the frequent itemsets problem in data mining.

As a result of this project, three PhD students have finished, a fourth will finish during the current year, and others are in the pipeline. Several MS students have performed experiments relating this topic. Lecture materials for undergraduate classes have been developed, covering background and current research results

Goals, Objectives and Targeted Activities

  • relate information dependency measures to other structural concepts, such as Kivenen & Mannila's approximate functional dependencies
  • study how new relational decomposition patterns, motivated by the above, impact such areas as query processing
  • investigate the interaction between information dependency measures (and other variations with an information-theoretic flavor) and the operations of relational algebra
  • study whether information dependency measures can help characterize standard benchmarks
  • investigate whether an information-theoretic perspective can facilitate database visualization, characterization of "interestingness", etc.

Area Background

Just as traditional information theory investigates the consequences of knowing the "surprisingness" of receiving a particular message, this work looks at measures based on the "surprisingness" of finding a particular value in a database table. These measures are especially significant when considered across multiple columns of a table, allowing us to phrase question such as "If we know the gender of an employee, does that provide any information about job_classification?" Questions such as this are generalized to information dependency [InD] measures. Requiring that an InD measure satisfy some particular arithmetic constraint may have interesting consequences on the structure of the underlying relation instance. For example, requiring that a particular measure equal 0 forces a functional dependency in the relation - knowing that a functional dependency exists has important consequences concerning data quality and redundancy. Furthermore, arithmetic inequalities that must always hold between various InD measures in a relation generalize correspondences between various functional dependencies and the like.

Area References

  • T. Cover and J. Thomas, Elements of Information Theory, John Wiley & Sons, New York, 1991.
  • M. Dalkilic & E. Robertson, "Information dependencies", 2000 PODS, ACM, May 2000.
  • J. Kivinen & H. Mannila, "Approximate inference of functional dependencies from relations", Theoretical Computer Science, 1995.