Project Reporting FINAL REPORT FOR AWARD # 0082407

Edward L Robertson ; Indiana University
Information Dependencies in Databases and Data Mining

Participant Individuals:
CoPrincipal Investigator(s) : Dirk Van Gucht; Mehmet M Dalkilic
Graduate student(s) : Chris M Gianella; Bassem Sayrafi; John A Springer

Partner Organizations:
Center for Genomics & Bioinformatics -IU: Collaborative Research

source for bioinformatics problems

Internat'l Org for Standardization: Collaborative Research
ISO Technical Comm 184/SubComm 5/Working Group 1 provides both
motivation and a sounding board for research in high-level modeling.
Richard Martin (see collaborators) is convenor of TC 184/SC 5/WG 1.

Other collaborators:

Chris Giannella and Edward Robertson have worked with
Jaiwei Han of University of Illinios Urbana-Champaign

Edward Robertson and John Springer have worked with Richard 
A Martin, private consultant of Bloomington IN., on enterprise
architecture frameworks.

Dirk Van Gucht and his students have worked with Jan Paradaens, Marc
Gyssens, and Nele Dexters of the University of Antwerp, Belgium
and Jan Van den Bussche and Stijn Vansummerin of Univ. Hasselt.
Van Gucht has also collaborated with Hari Bercovici of the IU
Mathematics Department.

Edward Robsertson is working with Cathy Wyss of Indiana University
and her student George Fletcher on metadata formalisms.

Memhet Dalikic is working with Sun Kim of Indiana University
and students L. Do Hoon and J.H. Choi on bioinformatics.
He is also working with Arijit Sengupta, now of Wright State
University, and students H. Kim and M. Fox on Ontologies.
.

Activities and findings:

Research and Education Activities: 
1. conceptual and theoretical analysis of the basic principles of information and measurement in a database context 2. simulation experiments relating to the the application of Information Dependencies and information theory in general to query optimization, histogram construction, etc. 3. weekly seminar series

Findings:
1. a better understanding of the strengths and weaknesses of the InD approach 2. placing the InD approach in a theoretical context and generalizing the InD measure to other measures, approximation results, and meassure constraint inferencing 3. promising applications in query optimization and histogram construction 4. understanding that issues involving metadata were inherent in the Information Dependency, with the additional understanding that metadata concerns had a much wider range, from implementation through querying to modeling. Thus additional work in these area has been undertaken, producing an elegant algebra for relational metadata querying and initiating work on enterprise-level modeling. 5. a very broad approach to measures in databases, relating to such things as level of support as well as the information theoretic measures which initiated this work.

Training and Development:
1. incorporation of information theory in information systems courses (surprisingly, there has been essentially no connection between these areas up to this point)

Outreach Activities:
working with ISO on various metamodeling standards

Journal Publications:
Catharine M. Wyss, Chris M. Giannella, and Edward Robertson, "FastFDs: A Heuristic-Driven Depth-First Algorithm for Mining Functional Dependencies from Relation Instances", Proceedings of the 3rd International Conference on Data Warehousing and Knowledge Discovery, vol. , (2002), p. . Accepted
Chris Giannella and Edward Robertson, "On an Information Theoretic Approximation Measure for Functional Dependencies", Information Processing Letters, vol. 85, (2003), p. 153. Published
Dennis P. Groth and Edward L. Robertson, "Discovering Frequent Itemsets in the Presence of Highly Frequent Items", 14th International Conference on Applications of Prolog, vol. , (2002), p. . Accepted
Dennis P. Groth and Edward L. Robertson, "An Integrated System for Database Visualization", The Sixth International Conference on Information Visualization, vol. , (2002), p. 462. Published
Chris M. Giannella, Mehmet M. Dalkilic, Dennis P. Groth, Edward L. Robertson, "Improving Query Evaluation with Approximate Functional Dependency Based Decompositions", Proceedings of the 19th British National Conference on Databases, vol. , (2002), p. 26. Published
Dennis P. Groth and Edward L. Robertson, "An Entropy-Based Approach to Visualizing Database Structure", Sixth IFIP Working Conference on Visual Database Systems, vol. , (2002), p. 157. Published
Dennis P. Groth and Edward L. Robertson, "An Integrated Approach to Database Visualization", Advanced Visual Interfaces 2002, vol. , (2002), p. 365. Published
Dennis P. Groth and Edward L. Robertson, "Discovering Frequent Itemsets in the Presence of Highly Frequent Items", Rule Based Data Mining 2001, vol. , (2002), p. 237. Published
Edward L. Robertson, Lawrence Saxton, and Dirk Van Gucht, "Polynomial-Time Query Languages for Untyped Lists", International Conference on Database Theory, vol. , (2002), p. . Submitted
Chris M. Giannella and Edward L. Robertson, "A Comparison of Three Approximation Measures for Functional Dependencies in Relational Databases", IPL, vol. , (2002), p. . Submitted
Chris M. Giannella and Edward L. Robertson, "A Note on Approximation Measures for Multi-valued Dependencies in Relational Databases", Information Processsing Letters, vol. 85, (2003), p. 153. Published
Chris M. Giannella and Edward L. Robertson, "What's Been Lost in a Lossy Join Decomposition?", SIGMOD Record, vol. , (2002), p. . Submitted
Chris M. Giannella, "An Axiomatic Approach to Defining Approximation Measures for Functional Dependencies", Lecture Notes in Computer Science, vol. 2435, (2002), p. 37. Published
Edward Robertson and Catharine Wyss, "Optimal Tuple Merge is NP-complete", Information Processing Letters, vol. , (), p. . Submitted
Catharine Wyss and Edward Robertson, "Relational Interoperability", TODS, vol. , (), p. . Submitted
Chris Gianella, Memhet Dalkilc, Dennis Groth, and Edward Robertson, "Using Horizontal-Vertical Decompositions to Improve Query Evaluation", Lecture Notes in Computer Science, vol. 2405, (2002), p. 26. Published
I. Gunduz, S. Zhao, M. Dalkilic and S. Kim, "Motif Discovery from Large Number of Sequences: A Case Study with Disease Resistance Genes in Arabidopsis thaliana", The 2003 International Multiconference in Computer Science and Computer Engineering, vol. , (2003), p. 0. Published
Bassem Sayrafi and Chris Giannella, "An Information Theoretic Histogram for Single Dimentional Selectivity Estimation", ICDE, vol. , (2003), p. . Submitted
Paul Purdom, Dirk Van Gucht and Dennis Groth, "Average Case Performance of the Apriori Algorithm", SIAM Journal of Computing, vol. , (2003), p. . Accepted
Bassem Sayrafi, Dirk Van Gucht and Marc Gyssens, "Measures in Databases and Data Mining", Information Sciences, vol. , (), p. . Submitted
Bassem Sayrafi and Dirk Van Gucht, "Applications of Tsallis entropies in databases", Information Sciences, vol. , (2003), p. . Submitted
Chris Giannella and Edward Robertson, "On an Information Theoretic Approximation Measure for Functional Dependencies", Information Systems, vol. , (), p. . Accepted
Richard Martin, Edward Robertson, and John Springer, "Architectural Principles for Enterprise Frameworks", EMMSAD '04, vol. , (), p. . Accepted
Zhiping Wang, Mehmet Dalkilic, and Sun Kim, "Guiding Motif Discovery by Iterative Pattern Refinement", 2004 ACM Symposium on Applied Computing Bioinformatics Track (Nicosia, Cyprus) , vol. , (2004), p. 162. Published
Irfan Gunduz, Sihui Zhao, Mehmet Dalkilic and Sun Kim, "Motif Discovery from Large Number of Sequeces: A Case Study with Disease Resistance Genes in Arabidopsis thaliana", Medicine and Biological Sciences (METMBS'03: Las Vegas, Nevada), vol. , (2003), p. 29. Published
Catharine M. Wyss and Edward L. Robertson, "Relational Languages for Metadata Integration", ACM Trans. on Database Systems, vol. 29, (2005), p. 624. Published
George Fletcher, Catharine Wyss, Edward Robertson, and Dirk Van Gucht, "A Calculus for Data Mapping", Proc. Int'l Conf. on Database Interoperability, Elsevier Notes in Theoretical Computer Science,, vol. , (), p. . Accepted
Catharine Wyss and Edward Robertson, "A Formal Characterization of PIVOT/UNPIVOT", Proc. ACM Conf. on Information and Knowledge Management, vol. , (), p. . Accepted
Edward Robertson and Richard Martin, "Views in the Enterprise Domain", Views, Aspects, and Roles 05, workshop at European Conf. on Object Oriented Programming, vol. n.a., (2005), p. electroni. Published
Edward L. Robertson, "Triadic Relations: an Algebra for the Semantic Web", Semantic Web and Databases, C. Bussler, V. Tannen, and I. Fundulaki (eds), Lecture Notes in Computer Science, vol. 3372, (2004), p. 91. Published
Dexters, N., Purdom, P.W. and Van Gucht, D, "A Probability Analysis of Candidate-Based Frequent Item Set Mining Algorithms", Proc. of the 21st ACM Symposium on Applied Computing Special Track on Data Mining, vol. , (), p. . Accepted
Sayrafi, B., Van Gucht, D., and Purdom, P.W., "On the Effectiveness and Efficiency of Computing Bounds on the Support of Item Sets in the Frequent Item Set Mining Problem", Proc. First International Workshop on Open Source Data Mining, vol. n.a., (2005), p. 46. Published
Sayrafi, B. and Van Gucht, D., "Differential Constraints", Proc. of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, vol. 24, (2005), p. 348. Published
Bercovici, H. and Van Gucht, D., "An Inequality for Mixed L^p-Norms", Mathematical Inequalities & Applications, vol. 8, (2005), p. 1223. Published
Van den Bussche, J., Van Gucht, D., and Vansummeren, S., "Well-definedness and semantic type-checking in the nested relational calculus and XQuery", Proc. of the 2005 International Conference on Database Theory (ICDT '05), vol. n.a., (2005), p. 99. Published
Dexters, N., Purdom, P.W., Van Gucht, D., "A Probability Analysis of Candidate-Based Frequent Item Set Mining Algorithms", SIAM J. on Computing, vol. , (), p. . Submitted
Edward Robertson, "Explicitly Modeling Metadata", CACM, vol. , (), p. . Submitted
L. Do Hoon, J.H. Choi, M.M. Dalkilic, S. Kim, "COMPAM : Visualization of Combining Pairwise Alignment", Bioinformatics, vol. , (), p. . Accepted
H. Kim, A. Sengupta, M. Fox, M.M. Dalkilic, "A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web", Journal of Database Management, vol. , (), p. . Accepted

Book(s) of other one-time publications(s):
Memhet Dalkilic, "PFA: The Protein Family Annotator" , bibl. Boston, MA, (2003). workshop Published
of Collection: , "Executive IT Life Sciences Forum"
Richard Martin and Edward Robertson, "A Comparison of Frameworks for Enterprise Architecture Modeling" , bibl. ISBN: 3-540-20299-4, (2003). Book Published
of Collection: Song, I.-Y.; Liddle, S.W.; Ling, T.W.; Scheuermann, P., "Conceptual Modeling -- ER 2003"

Other Specific Products:


Internet Dissemination:

http://www.cs.indiana.edu/database/InD/index.html

Contributions to Other Disciplines:

 Memhet Dalkilic's work with the Center for Genomics and
Bioinformatics
has resulted in the development of the Protein Family Annotator,
as well as other bioinformatics results.


Categories for which nothing is reported:
Products: Other Specific Product
Contributions Within Discipline
Contributions to Education and Human Resources
Contributions to Resources for Research and Education
Contributions Beyond Science and Engineering


FastLane Home Page Take you to the Project System Control Screen We welcome comments on this system

If you have trouble accessing any FastLane page, please contact the FastLane Help Desk at 1-800-673-6188