We are still dealing with the surprise factor of information
* not the surprise of receiving a message
* but the surprise of finding a value in a relation
Assumptions:
* based on relation instance(s)
* instance is a «multiset» (or «bag»)
where one
tuple may occur multiple times
1. r is a relation instance with attributes R
2. X subset R, with values x1, . . ., x(l)
similarly for Y, Z
3. XY means X UNION Y, etc.
4. Assume all tuples equally likely and
define
pX=xi (also written
pxi) as
( SELECT COUNT(*)
FROM r
WHERE X = xi )
_________________________
( SELECT COUNT(*)
FROM r )
5. Because limp->0(p×log(1/p)) = 0,
assume 0×log(1/0) = 0
Define the entropy of X in r as
(omit r when only one instance is being discussed)
(This is the same formula as H(P).)
H increases with X. That is, HX <= HXY
If r is a set instance (no duplicates) with k tuples, then HR = log(k)
The Information Dependency measure answers:
«what information do we need to determine Y provided we already know X?»
Abbreviate by «InD measure» and denote by HX->Y
Define by evaluating HY in each partition on xi, that is in
SELECT *
FROM r
WHERE X = xi and weight by
pX=xi
Equivalently: HX->Y = HXY - HX
Instance and entropies:
r
+-----------+------------+
|X | Y | HX = 7/4
+-----------+------------+
| | |
| a | e |
+-----------+------------+
| a | f | HY = 3/2
+-----------+------------+
| a | e |
+-----------+------------+
| a | f | HXY = 9/4
+-----------+------------+
| b | g |
+-----------+------------+
| b | g | HX->Y = 1/2
+-----------+------------+
| c | g |
+-----------+------------+
| d | g | HY->X = 3/4
+-----------+------------+
Encodings:
X XY Y
00 00 f
a 0
01 01 e
b 10 10
c 110 110 1 g
d 111 111
The following hold in all instances r:
1. HXY->X = 0
2. HX->Y + HX->Z >= HX->YZ
3. HXZ->YZ = HXZ->Y
4. HXY->Z <= HX->Z
5. HXZ->YZ <= HX->Y
6. HX->Y + HY->Z >= HX->Z
InD inequalities hold for any r, but we can also stipulate other numeric relationships.
=> InD constraint
e.g. HX->Y <= 0.5
Similar to «foreign key constraint»:
* an InD constraint only holds because
we have chosen to enforce it
Also, for some particular instance r, we might
observe (that is, calculate this H in r)
Theorem:
X->Y
IFF
HX->Y = 0
Moreover, Armstrong's Axioms are just special cases of InD inequalities:
* reflexivity: X->Y if YsubsetX (ineq. 1)
* monotonicity: X->Y => XZ->YZ (ineq. 5)
* transitivity: X->Y & Y->Z => X->Z (ineq. 6)
A new characterization of MVDs:
o X, Y, and Z partition R
Definition: X ->-> Y|Z in r IFF there exists r' and r'', over XY and XZ respectively, such that
When r is a set (no duplicates), this is a standard MVD characterization:
* r' = piXY(r)
* r'' = piXZ(r)
Theorem:
X ->-> Y|Z in r IFF HX->Y + HX->Z = HX->YZ
* Alternative definition of MVDs
works for both sets and
multisets
Numbers to the right are counts
r
+-----------+------------+------------+
| | | |
|X | Y | Z |
+-----------+------------+------------+
| | | |
| a | b | d | 8
+-----------+------------+------------+
| | | |
| a | b | e | 4
+-----------+------------+------------+
| | | |
| a | c | d | 4
+-----------+------------+------------+
| | | |
| a | c | e | 2
+-----------+------------+------------+
HX = 0
HX->Y = HY = HZ = HX->Z = 0.918
HX->YZ = HYZ = 1.837
r' r''
+-----------+------------+ +------------+------------+
| | | | | |
|X | Y | | X | Z |
+-----------+------------+ +------------+------------+
| | | | | |
| a | b | 4 | a | d | 2
+-----------+------------+ +------------+------------+
| | | | | |
| a | c | 2 | a | e | 1
+-----------+------------+ +------------+------------+
Consider H over
Bounds on d :
* d >= 0
* d <= e
Explaining why X ->-> Y|Z fails, decompose r into
r' = PIXY(r) and r'' = PIXZ(r)
This decomposition is lossless if r' |X| r'' = r and lossy otherwise
Example of lossy decomposition:
r r' r''
+-----------+------------+------------+ +------------+------------+ +------------+------------+
| | | | | | | | | |
|X | Y | Z | | X | Y | | X | Z |
+-----------+------------+------------+ +------------+------------+ +------------+------------+
| | | | | | | | | |
| a | b | d | | a | b | | a | d |
+-----------+------------+------------+ +------------+------------+ +------------+------------+
| | | | | | | | | |
| a | b | e | | a | c | | a | e |
+-----------+------------+------------+ +------------+------------+ +------------+------------+
| | | |
| a | c | d |
+-----------+------------+------------+
Q: r' |X| r'' superset r, so what is lost?
A: HX->Z - HXZ->Y ( e - d in previous diagram )
* known as «joint information between Y and Z given X»
Datacube technique computes the counts needed to compute H
* standard datamining operation
* compute counts on all subsets of R
* huge output
=> costly (unnecessarily so?)
=> heuristic disregards small cases
* optimization + heuristics
* currently used heuristics may not apply heuristics with small datacube errors may have big impact on entropy
* new ones might
Given r, what are all X,Y such that
or any other constant instead of 0.10
Given r, for what partitions XYZ of R is
HX->Y + HX->Z - HX->YZ <= .10?
Kivinen and Mannila define three measures for approximate functional dependencies:
g1 fraction of violating pairs
g2 fraction of violating tuples
g3 minimum fraction of tuples removed to eliminate all violations
where s and t are violating if
s.X = t.X & s.Y <>
t.Y
InD's can make distinctions that the g measures cannot
+------------+------------+
| | |
| X | Y |
+------------+------------+
| | |
| d | 1 | |
| | |
| d | 1 | |
| | |
| a | 1 | |
| | |
| a | 2 | |
| | |
| a | 3 | |
| | | t
| a | 4 | | |
| | |
| | a | 5 | | |
| | |
| | a | 6 | | s |
| | |
r | | b | 1 | | |
| | |
| | c | 1 | | |
| | |
| | c | 2 |
+------------+------------+
| |
| |
| HX HX->Y g1 g2 g3 |
-------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| |
r | 1.52 .80 .16 .8 .4 |
| |
s | 1.37 .95 .36 .8 .4 |
| |
t | 1.57 1.55 .36 .8 .4 |
-------------------+----------------------------------------------------------------------------------------------------------------------------------------+
Data Warehouse:
Archived transaction data organized to support subsequent «decision support» processing
Use of InD measures:
indicate what data and relationships may be significant
e.g. «partial» functional dependencies may indicate space savings
Information maps of two database benchmarks
each point
corresponds to one attribute
Which benchmark is natural and which is synthetic?
* Given r over X, Y, and Z, where HX->Y < e
* (so X->Y holds approximately)
* Decompose r by:
* subtract «correction» rc from r so FD holds in remainder
* do FD decomposition of remainder into r' (over XZ) and r'' (over XY)
Thus r = r' |X| r'' U rc
* Improves performance of certain queries over a wide range of e
* More difficult with MVDs
Dissertation of Jeremy Engle
Search space similar to many other datamining problems
set of
subsets of {A1, . . . An}
New global algorithmic approach: attribute at a time
each
iteration adds one more attribute to exploration
Framework for experimentation with local search tactics
Opportunities to incorporate machine learning, visualization, . . .
Extend to multiple relations:
adding probabilities facilitates set-based relational algebra
Consider InD constraint systems as a logical formalism
linear algebra does a lot
More work with information theory and statistics
e.g. the best hash function maximizes entropy
© Copyright Edward Robertson, 2109-3-5