DocBase - A Database Environment for Structured Documents Arijit Sengupta Abstract Standard Generalized Markup Language (SGML) has been universally accepted as a standard for universal document representation. The strength of SGML lies in the fact that it embeds logical structural information in documents while preserving a human-readable form. Although the original purpose of this embedded information was to facilitate interchange and layout, the structural information in SGML documents opens the door for processing these documents using database techniques. SGML facilitates this goal in a twofold manner. First, it provides a conceptual modeling tool for collections of documents using a document type definition (DTD). The DTD serves as a schema to which document instances need to conform. Second, the logical structure of the documents allows query processing beyond the classic keyword-based searches of traditional IR systems. For instance, instead of searching for specific keywords in the whole document, one can search for keywords in the footnotes of specific sections. This dissertation uses these observations about SGML as the design principles for developing and implementing a structured document database system. The key difference of our approach from other similar approaches is that the design and implementation remain entirely within the context of the SGML framework. We achieve this by using SGML as the modeling tool of the database instances, generating SGML documents as outputs of the queries, and also using SGML for expressing queries. DocBase is a prototype research system that implements most of the concepts presented here. DocBase uses a three-level design (external - conceptual - physical) similar to the design of standard database systems, although the current implementation stresses on the conceptual and physical components of the system. At the conceptual level, we use SGML itself as the model for structured document databases, with the database schema represented using a DTD and SGML documents as instances of this schema. We also use an extended form of relational calculus to handle the complex structure and path expressions, and equivalent SQL-like and visual query languages for posing queries. At the physical level, the data is stored in the secondary storage controlled by a storage manager, with index structures built on top of the documents. We use access methods over these data structures to efficiently implement the queries. Recognizing the importance of users in the design of systems for document retrieval, we propose a visual query formulation method that uses the principle of familiarity to make the querying process easier and more satisfying for users. We show that even at the simplest level, this method is no less efficient or accurate than the traditional form-based query formulation, but is significantly more satisfying. In addition, we show that this method is equivalent to the other query languages proposed in this dissertation.