DocBase is the successor of SGMLQuery, and contains all features of SGMLQuery and includes SQL support

DocBase - A Document Database System

DocBase started as a research project in the Department of Computer Science at Indiana University, Bloomington. This research was part of my dissertation, under the guidance of Prof. Dirk Van Gucht, Prof. Edward Robertson, Prof. Andrew Dillon and Prof. David Leake.

In its current implementation, DocBase acts primarily as a query processing system for structured documents. Right now, DocBase supports SGML (Standard Generalized Query Language - ISO 8879), and XML with DTDs. DTDless XML support is planned in a future release.

Development team

DocBase never really had a big development team. After I started the initial implementation in January 1996, I had a few students work on parts of the code and make very strong contributions. I am very thankful to these students for their help with the project.

Development history

  1. Fall 1995: first conception of the system - as a means for setting up access to the Chadwyck Healey English Poetry Database by the Bloomington community.
  2. December 1995: Implementation of the form interface for the poetry database access, and the backend processing system
  3. January 1996: Start of implementation of the Java query interface as an alternative to form-based querying.
  4. May 1996: First version of SGMLQueryfinished, and usability analysis performed on the system.
  5. December 1996: Formalization of QBT based on the SGMLQuery idea, and generalization of the interface into a visual query language
  6. December 1996: Formalization of the query languages and processing ideas for DocBase, based on the processing engine for SGMLQuery
  7. August 1997: Initial implementation of the DocBase engine complete, testing and performance measures done
  8. December 1997: Dissertation completed, the thesis on DocBase published
  9. March 1998 - now: Ongoing work on improving the implementation, developing a public release with support for a free storage manager and indexing system.

Publications related to DocBase

Arijit Sengupta. "The compleat closure: toward a unified view of structured document database objects" Accepted for publication at the Fifth international Conference on Information Systesm Analysis and Synthesis (ISAS '99). Orlando, Florida. July 31-August 4, 1999

Arijit Sengupta. "Toward the union of databases and document management: The design of DocBase." Accepted for publication in Proceedings: Conference on Management of Data (COMAD'98), Hyderabad, India, December 17-19 1998. Available in postscript [548K]. (Text Abstract)

Arijit Sengupta. "DocBase - A Database Environment for Structured Documents". Ph.D. Thesis. Indiana University, Bloomington. December, 1997 available as gzipped postscript [600K] (Text Abstract)

Arijit Sengupta and Andrew Dillon. "Extending SGML to Accommodate Database Functions: A Methodological Overview." Journal of the American Society of Information Systems (JASIS), special issue on structured information/standards for document architectures. pages 629-637, July, 1997. Available in postscript [744K]. (Text Abstract)

Arijit Sengupta and Andrew Dillon. "Query By Templates: A Generalized Approach for Visual Query Formulation for Text Dominated Databases." in Proceedings: Conference on Advanced Digital Libraries (ADL'97), Library of Congress, Washington, D.C. pages 36-47. May 7-9 1997. Available in postscript [776K] (Text abstract)

Arijit Sengupta. "Standardizing the Querying Process with SGML: The SQL DTD." In Tommie Usdin and Debbie Lapeyre, editors, Proceedings of the SGML'96 Conference. Graphic Communications Association, pages 323-337, November, 1996. (Presented at conference.) Available in SGML, also available in postscript [800K] (Text Abstract)

Arijit Sengupta. "Demand More from Your SGML Database! Bringing SQL Under the SGML Limelight." <TAG>, 9(4):pages 1-7, April 1996. Available in postscript.[352K] (Text Abstract)


Currently, there is one completed system demonstration available with DocBase, for the Chadwyck-Healey English Poetry Database. Because of copyright restrictions, the actual poems are not available for access outside Indiana University. To access the demonstration, please refer to the QBT page.

More details

Source code of DocBase and QBT is currently not publicly available. However, after the initial release is completed (projected date: August 1999), we will have part of whole of the system available for download under the GNU public license. If you would like to be notified when a release is available, please contact me at

Copyright (C) 1998 by Arijit Sengupta

All rights reserved

Last modified: Mon May 24 22:48:04 EST 1999