Information Dependencies

Information Dependencies

prepared March 5, 2009

Slides by Topic

  • Introduction
  • Database Context
  • Notations & Conventions
  • InD Measure
  • Probability & Entropy from Counts
  • InD Definition
  • Example
  • InD Inequalities
  • Inequality Facts
  • InD Constraints
  • Constraints and Observations
  • InDs and Traditional Dependencies
  • Functional Dependencies
  • Multivalued Dependencies
  • Example Decomposition
  • Relating FDs and MVDs
  • Explaining «Lossless» Joins
  • Applications
  • Computing H
  • Approximate FDs and MVDs
  • Comparing Functional Dependency Approximations
  • Approximate Functional Dependency Example
  • Data Warehousing
  • Visualizing Entropy Properties
  • Query Optimization
  • Mining for AFD's
  • Other Future Work

  • Database Context

    We are still dealing with the surprise factor of information

    * not the surprise of receiving a message

    * but the surprise of finding a value in a relation

    Assumptions:

    * based on relation instance(s)

    * instance is a «multiset» (or «bag»)
    where one tuple may occur multiple times


    Notations & Conventions

    1. r is a relation instance with attributes R

    2. X subset R, with values x1, . . ., x(l)

    similarly for Y, Z

    3. XY means X UNION Y, etc.

    4. Assume all tuples equally likely and
    define pX=xi (also written pxi) as

      ( SELECT COUNT(*)
        FROM r
        WHERE X = xi )
       _________________________
      ( SELECT COUNT(*)
        FROM r ) 

    5. Because limp->0(p×log(1/p)) = 0,

    assume 0×log(1/0) = 0


    Probability & Entropy from Counts

    Define the entropy of X in r as

    HX(r) = SUMli=1 pxi × log(1/pxi)

    (omit r when only one instance is being discussed)

    (This is the same formula as H(P).)

    H increases with X. That is, HX <= HXY

    If r is a set instance (no duplicates) with k tuples, then HR = log(k)


    InD Definition

    The Information Dependency measure answers:

    «what information do we need to determine Y provided we already know X

    Abbreviate by «InD measure» and denote by HX->Y

    Define by evaluating HY in each partition on xi, that is in

        SELECT *
        FROM r
        WHERE X = xi 
    and weight by pX=xi

    Equivalently: HX->Y = HXY - HX


    Example

    Instance and entropies:

    
                 r
              +-----------+------------+
              |X | Y |              HX    =   7/4
              +-----------+------------+
    
    
              |           |            |
              |    a      |     e      |
              +-----------+------------+
              |    a      |     f      |              HY    =   3/2
              +-----------+------------+
              |    a      |     e      |
              +-----------+------------+
              |    a      |     f      |    HXY    =   9/4
              +-----------+------------+
              |    b      |     g      |
              +-----------+------------+
              |    b      |     g      |  HX->Y    =   1/2
              +-----------+------------+
              |    c      |     g      |
              +-----------+------------+
              |    d      |     g      |  HY->X    =   3/4
              +-----------+------------+
    
              

    Encodings:

    
    
              X   XY   Y
    
                           00                     00      f
              a    0
                           01                     01      e
    
              b    10      10
    
              c    110     110                    1       g
    
              d    111     111
    
    
              


    Inequality Facts

    The following hold in all instances r:

    1. HXY->X = 0

    2. HX->Y + HX->Z >= HX->YZ

    3. HXZ->YZ = HXZ->Y

    4. HXY->Z <= HX->Z

    5. HXZ->YZ <= HX->Y

    6. HX->Y + HY->Z >= HX->Z


    Constraints and Observations

    InD inequalities hold for any r, but we can also stipulate other numeric relationships.

    => InD constraint

    e.g. HX->Y <= 0.5

    Similar to «foreign key constraint»:

    * an InD constraint only holds because
    we have chosen to enforce it

    Also, for some particular instance r, we might
    observe (that is, calculate this H in r)

    HX->Y(r) = 0.463 <= 0.5


    Functional Dependencies

    Theorem:
    X->Y IFF HX->Y = 0

    Moreover, Armstrong's Axioms are just special cases of InD inequalities:

    * reflexivity: X->Y if YsubsetX (ineq. 1)

    * monotonicity: X->Y => XZ->YZ (ineq. 5)

    * transitivity: X->Y & Y->Z => X->Z (ineq. 6)


    Multivalued Dependencies

    A new characterization of MVDs:

    o X, Y, and Z partition R

    Definition: X ->-> Y|Z in r IFF there exists r' and r'', over XY and XZ respectively, such that

    r = r' |X| r''

    When r is a set (no duplicates), this is a standard MVD characterization:

    * r' = piXY(r)

    * r'' = piXZ(r)

    Theorem:

    X ->-> Y|Z in r IFF HX->Y + HX->Z = HX->YZ

    * Alternative definition of MVDs
    works for both sets and multisets


    Example Decomposition

    Numbers to the right are counts

    
    
                              r
                    +-----------+------------+------------+
                    |           |            |            |
                    |X | Y | Z |
                    +-----------+------------+------------+
                    |           |            |            |
                    |    a      |     b      |     d      |  8
                    +-----------+------------+------------+
                    |           |            |            |
                    |    a      |     b      |     e      |  4
                    +-----------+------------+------------+
                    |           |            |            |
                    |    a      |     c      |     d      |  4
                    +-----------+------------+------------+
                    |           |            |            |
                    |    a      |     c      |     e      |  2
                    +-----------+------------+------------+
    
    
    

    HX = 0

    HX->Y = HY = HZ = HX->Z = 0.918

    HX->YZ = HYZ = 1.837

    
    
          r'                   r''
       +-----------+------------+            +------------+------------+
       |           |            |            |            |            |
       |X | Y |            | X | Z |
       +-----------+------------+            +------------+------------+
       |           |            |            |            |            |
       |    a      |     b      |  4         |     a      |     d      |  2
       +-----------+------------+            +------------+------------+
       |           |            |            |            |            |
       |    a      |     c      |  2         |     a      |     e      |  1
       +-----------+------------+            +------------+------------+
    
    
    


    Relating FDs and MVDs

    Consider H over

    expand this figure

    Bounds on d :

    * d >= 0

    * d <= e


    Explaining «Lossless» Joins

    Explaining why X ->-> Y|Z fails, decompose r into

    r' = PIXY(r) and r'' = PIXZ(r)

    This decomposition is lossless if r' |X| r'' = r and lossy otherwise

    Example of lossy decomposition:

    
    
              r                      r'              r''
    +-----------+------------+------------+       +------------+------------+       +------------+------------+
    |           |            |            |       |            |            |       |            |            |
    |X | Y | Z |       | X | Y |       | X | Z |
    +-----------+------------+------------+       +------------+------------+       +------------+------------+
    |           |            |            |       |            |            |       |            |            |
    |    a      |     b      |     d      |       |     a      |     b      |       |     a      |     d      |
    +-----------+------------+------------+       +------------+------------+       +------------+------------+
    |           |            |            |       |            |            |       |            |            |
    |    a      |     b      |     e      |       |     a      |     c      |       |     a      |     e      |
    +-----------+------------+------------+       +------------+------------+       +------------+------------+
    |           |            |            |
    |    a      |     c      |     d      |
    +-----------+------------+------------+
    
    
    

    Q: r' |X| r'' superset r, so what is lost?

    A: HX->Z - HXZ->Y ( e - d in previous diagram )

    * known as «joint information between Y and Z given X»


    Computing H

    Datacube technique computes the counts needed to compute H

    * standard datamining operation

    * compute counts on all subsets of R

    * huge output

    => costly (unnecessarily so?)

    => heuristic disregards small cases

    * optimization + heuristics

    * currently used heuristics may not apply heuristics with small datacube errors may have big impact on entropy

    * new ones might


    Approximate FDs and MVDs

    Given r, what are all X,Y such that

    HX->Y <= 0.10?

    or any other constant instead of 0.10

    Given r, for what partitions XYZ of R is

    HX->Y + HX->Z - HX->YZ <= .10?


    Comparing Functional Dependency Approximations

    Kivinen and Mannila define three measures for approximate functional dependencies:

    g1 fraction of violating pairs

    g2 fraction of violating tuples

    g3 minimum fraction of tuples removed to eliminate all violations

    where s and t are violating if
    s.X = t.X & s.Y <> t.Y

    InD's can make distinctions that the g measures cannot


    Approximate Functional Dependency Example

    
                                        +------------+------------+
                                        |            |            |
                                        | X | Y |
                                        +------------+------------+
                                        |            |            |
                                        |     d      |     1      |                        |
                                        |            |            |
                                        |     d      |     1      |                        |
                                        |            |            |
                                        |     a      |     1      |                        |
                                        |            |            |
                                        |     a      |     2      |                        |
                                        |            |            |
                                        |     a      |     3      |                        |
                                        |            |            |                                 t
                                        |     a      |     4      | |                      |
                                        |            |            |
                                      | |     a      |     5      | |                 |
                                        |            |            |
                                 | |     a      |     6      | | s   |
                                        |            |            |
                   r | |     b      |     1      | |                 |
                                        |            |            |
                                 | |     c      |     1      | |                      |
                                        |            |            |
                                      | |     c      |     2      |
                                        +------------+------------+
    
    
           

    
    
                              |                                                                                                                                        |
                              |                                                                                                                                        |
                              | HX   HX->Y   g1   g2   g3 |
           -------------------+----------------------------------------------------------------------------------------------------------------------------------------+
                              |                                                                                                                                        |
           r |          1.52                            .80                           .16                       .8                       .4           |
                              |                                                                                                                                        |
           s |          1.37                            .95                           .36                       .8                       .4           |
                              |                                                                                                                                        |
           t |          1.57                           1.55                           .36                       .8                       .4           |
           -------------------+----------------------------------------------------------------------------------------------------------------------------------------+
    
    
    


    Data Warehousing

    Data Warehouse:

    Archived transaction data organized to support subsequent «decision support» processing

    Use of InD measures:

    indicate what data and relationships may be significant

    e.g. «partial» functional dependencies may indicate space savings


    Visualizing Entropy Properties

    Information maps of two database benchmarks

    each point corresponds to one attribute

    Which benchmark is natural and which is synthetic?

    expand this figure


    Query Optimization

    * Given r over X, Y, and Z, where HX->Y < e

    * (so X->Y holds approximately)

    * Decompose r by:

    * subtract «correction» rc from r so FD holds in remainder

    * do FD decomposition of remainder into r' (over XZ) and r'' (over XY)

    Thus r = r' |X| r'' U rc

    * Improves performance of certain queries over a wide range of e

    provided that query optimizer doesn't interfere

    * More difficult with MVDs


    Mining for AFD's

    Dissertation of Jeremy Engle

    Search space similar to many other datamining problems
    set of subsets of {A1, . . . An}

    New global algorithmic approach: attribute at a time
    each iteration adds one more attribute to exploration

    order of picking attributes important

    Framework for experimentation with local search tactics

    Opportunities to incorporate machine learning, visualization, . . .


    Other Future Work

    Extend to multiple relations:

    adding probabilities facilitates set-based relational algebra

    Consider InD constraint systems as a logical formalism

    linear algebra does a lot

    More work with information theory and statistics

    e.g. the best hash function maximizes entropy

    © Copyright Edward Robertson, 2109-3-5