INFORMATION DEPENDENCIES

Edward Robertson, PI
Memhet Dalkilic, co-PI
Dirk Van Gucht, co-PI
Dennis Groth, research assistant
Chris M. Giannella, research assistant

Computer Science Department & School of Informatics
Indiana University, Bloomington

Contact Information

Edward Robertson
Computer Science Department
Indiana University
215 Lindley Hall
Bloomington IN 57405
Phone: (812) 855-4954   Fax : (812) 855-4829
Email: edrbtsn@cs.indiana.edu   URL

WWW PAGE

Project URL

Project Award Information

Keywords

information theory, information dependency measure, functional and multivalued dependency, entropy, information structure, datamining

Project Summary

This project focuses on Information Dependency (InD) measures and the application of these measures to databases and datamining. InD measures use classical (Shannon) information theory to evaluate the information structure of database relations. This work extends results by the investigators of this project which show how InD measures generalize concepts important in database design, namely functional and multivalued dependencies. Research in this project is taking place across the spectrum from theory to practice. On the theoretical side, deeper details of InD's are investigated with an eye toward mechanisms for manipulating and applying InD measures. On the theoretic side, properties of InD's are investigated with an eye toward manipulating and applying InD measures, as well as toward implications of InD's on modeling. In the center, techniques for computing the measures are being investigated. Because the ultimate goal of datamining is to inform the user, investigations also include the interaction of InD and visualization. On the applied side, the major focus is the application of InD measures on data mining. Recognizing that research into applications requires real rather than "toy" targets, this project seeks collaborations involving data mining: the first such collaboration being with researchers in Biology. All of the activities of this project ultimately lead toward the development of prototype toolkit components based on InD measures.

Goals, Objectives, and Targeted Activities

Publications

  • Catharine M. Wyss, Chris M. Giannella, and Edward Robertson: "FastFDs: A Heuristic-Driven Depth-First Algorithm for Mining Functional Dependencies from Relation Instances", Proceedings of the 3rd International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2001), Munich, Germany, September 2001. Published in Lecture Notes in Computer Science 2112.
  • Chris Giannella and Edward Robertson "On an Information Theoretic Approximation Measure for Functional Dependencies" Technical Report 555, Indiana University Computer Science Department, Aug 2001.
  • Dennis P. Groth and Edward L. Robertson: "Discovering Frequent Itemsets in the Presence of Highly Frequent Items", Web-Knowledge Management and Decision Support - Selected Papers from the 14th International Conference on Applications of Prolog, Edited by Bartenstein, Geske, Hannebauer and Yoshie, Springer Verlag, Lecture Notes (To Appear)
  • Dennis P. Groth and Edward L. Robertson: "An Integrated System for Database Visualization", The Sixth International Conference on Information Visualization, London, England, July, 2002 (To appear).
  • Chris M. Giannella, Mehmet M. Dalkilic, Dennis P. Groth, Edward L. Robertson: "Improving Query Evaluation with Approximate Functional Dependency Based Decompositions", Proceedings of the 19th British National Conference on Databases (BNCOD2002), July, 2002 (To appear).
  • Dennis P. Groth and Edward L. Robertson: "An Entropy-Based Approach to Visualizing Database Structure", Sixth IFIP Working Conference on Visual Database Systems, May, 2002 (To appear).
  • Dennis P. Groth and Edward L. Robertson: "An Integrated Approach to Database Visualization", Advanced Visual Interfaces 2002, May, 2002 (To appear).
  • Chris M. Giannella, Mehmet M. Dalkilic, Dennis P. Groth, Edward L. Robertson: "Using Horizontal-Vertical Decompositions to Improve Query Evaluation", Indiana University Computer Science, Technical Report 558, February, 2002.
  • Dennis P. Groth and Edward L. Robertson: "Discovering Frequent Itemsets in the Presence of Highly Frequent Items", Rule Based Data Mining 2001, Tokyo, Japan.
  • Edward L. Robertson, Lawrence Saxton, and Dirk Van Gucht: "Polynomial-Time Query Languages for Untyped Lists", submitted.
  • Chris M. Giannella and Edward L. Robertson: "A Comparison of Three Approximation Measures for Functional Dependencies in Relational Databases", submitted.
  • Chris M. Giannella and Edward L. Robertson: "A Note on Approximation Measures for Multi-valued Dependencies in Relational Databases", submitted.
  • Chris M. Giannella and Edward L. Robertson: "What's Been Lost in a Lossy Join Decomposition?", submitted.
  • Chris M. Giannella: "An Axiomatic Approach to Defining Approximation Measures for Functional Dependencies", submitted.
  • Project Impact

    In general we have found that, as expected, information dependency measures are valuable for characterizing the general "landscape" of a relation instance. They are potentially valuable in drill-down situations as well, where a specific investigation makes the computational cost worthwhile. More interesting are the indirect insights, such as new approaches to table decomposition; insights which do not directly depend upon the measures but are suggested by considering aspects of the measures.

    As a result of this project, two PhD students will finish this summer and a third next year. One MS student has switched to the PhD program. Several MS students have performed experiments relating this topic. Lecture materials for undergraduate classes have been developed, covering background and current research results

    Area References

  • T. Cover and J. Thomas, Elements of Information Theory, John Wiley & Sons, New York, 1991.
  • M. Dalkilic & E. Robertson, "Information dependencies", 2000 PODS, ACM, May 2000.
  • J. Kivinen & H. Mannila, "Approximate inference of functional dependencies from relations", Theoretical Computer Science, 1995.
  • Area Background

    Just as traditional information theory investigates the consequences of knowing the "surprisingness" of receiving a particular message, this work looks at measures based on the "surprisingness" of finding a particular value in a database table. These measures are especially significant when considered across multiple columns of a table, allowing us to phrase question such as "If we know the gender of an employee, does that provide any information about job_classification?" Questions such as this are generalized to information dependency [InD] measures. Requiring that an InD measure satisfy some particular arithmetic constraint may have interesting consequences on the structure of the underlying relation instance. For example, requiring that a particular measure equal 0 forces a functional dependency in the relation - knowing that a functional dependency exists has important consequences concerning data quality and redundancy. Furthermore, arithmetic inequalities that must always hold between various InD measures in a relation generalize correspondences between various functional dependencies and the like.