Toward the Union of Databases and Document Management: The Design of DocBase Arijit Sengupta Department of Computer Science Lindley Hall 215 Bloomington, IN 47405 +1 (812) 855-3703 (voice) +1 (812) 855-4829 (fax) asengupt@cs.indiana.edu (email) Abstract With the advent of the World Wide Web (WWW) and the increased use of electronic documents in almost all aspects of computing, the problems of management of and systematic information retrieval from electronic documents have become highly pertinent. Information retrieval (IR) techniques allow us to retrieve documents based on keywords, but often these searches are not powerful enough to accurately extract the most relevant information. Most IR systems are designed for broadening the scope of search. However, extracting only documents highly related to the search is often desirable to keep the result set small. We propose a method for achieving very powerful searches on tagged documents by using the structural information in the tags as meta-data in queries. We adopt SGML as our tagging format. Since HTML is an application of SGML, and XML is designed as a subset of the SGML standard, our work is immediately applicable to the current and future incarnations of the WWW. In this paper, we give an overview of the methodologies (such as query languages, query interfaces, and query processing techniques) used in the design of DocBase, our prototype proof-of-concept document database system. DocBase is a modular system capable of performing SQL-like queries on native SGML documents using pluggable indexing and storage-management applications. Because of the generalized nature of SGML, and the ever increasing use of structured documents in the corporate world, we argue that systems like this will be indispensable in the forthcoming century.