A Datatype Macro for Scheme

Jonathan Sobel and erik hilsdale
Computer Science Department
Indiana University
Bloomington, Indiana 47405
http://www.cs.indiana.edu/~jsobel/
http://www.cs.indiana.edu/~ehilsdal/

Description

[*]

This report describes a new form for Scheme that resembles ML's datatype form. This section explains the rationale behind the form's design and demonstrates its use in small examples. The second section explains how to implement the datatype form as a Scheme macro.

The Status Quo

[*]

Scheme does not have a standard, built-in way to define new types of objects. In fact, the notion of type is only loosely defined in the Revised^5 Report [cite r5rs]. It says:

No object satisfies more than one of the following predicates:
        boolean?          pair?
        symbol?           number?
        char?             string?
        vector?           port?
        procedure?
These predicates define the types boolean, pair, symbol, number, char (or character), string, vector, port, and procedure. The empty list is a special object of its own type; it satisfies none of the above predicates.
In Scheme, then, a type is simply the description of those objects that satisfy some particular type predicate (and no other type predicates).

The set of predicates/types listed above is closed in pure Scheme, in the sense that no standard Scheme procedures (or special forms) are specified to produce objects whose types are outside this set. In many Scheme implementations, however, it is possible to define new types of a limited form, usually record types. For example, some implementations allow definitions similar to:

  (define-record tree-node (datum left right))
Such definitions typically supply a new type-predicate (tree-node?), as well as a constructor and a set of accessors (and sometimes mutators) for the new record type. This minimal approach to type definitions is, in some ways, appropriate to the ``Ockham's Razor'' design of Scheme. On the other hand, it does not provide for one of the most common idioms in type definitions: the sum-of-products form. For example, we were able to describe the type of tree nodes in the preceding example, but there is no way to descibe the type of trees, other than by placing the definition of an empty-tree type adjacent to the definition of tree-node and adding a comment in the code, claiming that the two types ``go together'' somehow.

The datatype Form

[*]

To overcome the deficiencies of the status quo, we turn to ML for inspiration [cite ML-manual]. The datatype form is the most commonly used means of defining new types in ML; we adapt it here for use in Scheme. The define-datatype form does allow us to define a Tree type: [In many languages, it is standard---or even necessary---to name types with capitalized identifiers. Although Scheme is not case-sensitive, we stick to tradition here and use Tree, instead of tree.]

<Tree datatype (preliminary)>=
(define-datatype Tree
  (empty-tree)
  (node datum left right))

Now it is clear in the code that the Tree type consists of two variants: the empty-tree and the node. The define-datatype form defines a predicate Tree?, in keeping with the Scheme notion of types. It also defines the constructors make-empty-tree and make-node, used in expressions like (make-node 5 t1 t2).

In order to distinguish between the variants and extract the values of their fields (just the fields of node, in this example), we use the type-case form.

Suppose we have a tree of numbers, like this:

<tree of numbers>=
(make-node 2
          (make-node 1 (make-empty-tree) (make-empty-tree))
          (make-node 3 (make-empty-tree) (make-empty-tree)))

If we want to sum all the numbers in the tree, we can write a procedure Tree-sum, which uses type-case:

<sum the contents of a tree>=
(define Tree-sum
  (lambda (t)
    (type-case Tree t
      ((empty-tree) 0)
      ((node d l r) (+ d 
                       (Tree-sum l)
                       (Tree-sum r))))))

There is no loss of generality in having only the define-datatype form and eliminating the define-record form, because a datatype with only one variant is just a record type:

<student dataype>=
(define-datatype Student (student name major GPA))

More Type Information

[*]

The define-datatype form, as described in the preceding section, does not adequately capture all of the type information that we might want to include with the Tree datatype. For instance, we might wish to assert that it is nonsensical to construct a node with a boolean or a port for its left subtree; subtrees should always be Trees. To enable this constraint to be expressed in the define-datatype form, we turn again to Scheme's correlation of predicates and types:

<tree datatype>=
(define-datatype Tree
  (empty-tree)
  (node datum (left Tree?) (right Tree?)))

In this new definition of the Tree type, we have extended the description of the left and right fields to include predicates, which constrain the values stored in those fields. If we wanted to make sure that Trees can only contain numbers, we could replace datum with (datum number?). Fields without predicates get the default, always-true predicate: (lambda (x) #t). Now, if someone attempts to construct a node whose subtrees are not Trees, an error will be signaled.

Recycling

[*]

At times, it can be useful to recognize that some record is no longer being used and to ``recycle'' it as another record. This is precisely the technique used in a paper that describes stack-less data-structure construction [cite sobel-friedman:98]. To support this application, we add recycling constructors, which reuse existing records, instead of allocating new ones. Recycling constructors perform a kind of ``coercion-with-update'' on records.

As an example, suppose we define a record-type that contains a Tree, it's height, and a flag saying whether the tree is balanced or not:

<tree information datatype>=
(define-datatype Tree-Info
  (tree-info (height integer?) (tree Tree?) (balanced boolean?)))

If we have constructed a tree-info with

<make a tree-info>=
(define ti (make-tree-info 3 some-tree #t))

and we are sure that the tree-info record will never be needed again, [We can be sure that a record is no longer useful in several garbage-collection-style ways [cite morrisett-etc:95], via linear typing, are simply by design [cite sobel-friedman:98].] then we can use its memory for a node record:

<recycle a tree-info as a node>=
(recycle-as-node ti [right (make-empty-tree)])

This constructs a node with 3 as its datum, some-tree as its left subtree, and an empty-tree as its right subtree.

Implementation

[*]

The following code uses Chez Scheme's syntax-case macro expansion system. At macro-expansion time, Chez Scheme treats code not as lists, but as syntax objects. The syntax-case form is used to match and dissect syntax objects, and the syntax form is used to construct new syntax objects. Defining a macro really means defining a transformer of syntax-objects.

The define-datatype form needs to expand into a sequence of definitions for four different kinds of operators: variant constructors, ``recyclers,'' a predicate, and something that performs the ``case'' action. Accessors for individual fields in a variant do not need to be externally visible, nor do predicates for variants, since the type-case form can do the same work. Thus, we would like the Tree datatype definition to expand into something like this:

<expansion of Tree datatype>=
(begin
  (begin
    (define make-empty-tree
      (lambda ()
        <construct an empty-tree>))
    (define-syntax recycle-as-empty-tree
      <syntax transformer for recycle-as-empty-tree>))
  (begin
    (define make-node
      (lambda (datum left right)
        (if (and (Tree? left) (Tree? right))
            <construct a node>
            (error 'node "invalid argument types"))))
    (define-syntax recycle-as-node
      <syntax transformer for recycle-as-node>))
  (define Tree?
    (lambda (x)
      <test x for Tree-ness>))
  (define-syntax Tree
    <the type-case handler for Trees>))

And we would like the type-case form, when working with a Tree, to make Tree's handler (which we call Tree) do the work. That is,

<an example of type-case for Tree>=
(type-case Tree t
  ((empty-tree) 0)
  ((node d l r) (+ d 
                   (Tree-sum l)
                   (Tree-sum r))))

should expand into

<expansion of type-case for Tree>=
(Tree t
  ((empty-tree) 0)
  ((node d l r) (+ d 
                   (Tree-sum l)
                   (Tree-sum r))))

Thus the type-case macro will be

<type-case macro>=
(define-syntax type-case
  (lambda (x)
    (syntax-case x ()
      ((_ ?type-name ?exp ?clause ...)
       (syntax (?type-name ?exp ?clause ...))))))

and the basic form of the define-datatype macro will be as follows:

<define-datatype macro>=
(define-syntax define-datatype
  (letrec (<helper procedures for macro>)
    (lambda (x)
      (syntax-case x ()
        [(_ ?name (?variant ?field ...) ...)
         (and (identifier? (syntax ?name))
              (andmap identifier? (syntax (?variant ...))))
         (with-syntax ([(?constructor ...)
                        <construct names for variant constructors>]
                       [(?recycler ...)
                        <construct names for recyclers>]
                       [?datatype-predicate
                        <construct name for datatype predicate>]
                       [(?field-count ...)
                        <compute number of fields in each variant>]
                       [(((?field-name ?predicate) ...) ...)
                        <normalize field descriptions>])
           (syntax
            (begin
              <definitions for each variant> ...
              <definition of datatype predicate>
              <definition of type-case handler>)))]))))

It will be useful in more than one place to have syntax objects representing the number of fields in each variant, so we will compute and bind these in advance. The expression that immediately follows the pattern in syntax-case acts as a ``fender.'' It provides additional contraints, which must be met in order for that case to count as a match. For our purposes, we want to claim that the define-datatype form is only valid if the name of the type and the names of the variants are all legal identifiers.

Earlier, we quoted from the passage in the Revised^5 Report that required the predicates for the various Scheme types to be disjoint. Unless we have the power to change the underlying representation of Scheme objects (and the garbage collector, and so on), we will not be able to satisfy this requirement with the new types generated by a macro. The best we can hope for is that the new types be disjoint from all but one built-in type and that all new types be disjoint from each other. The implementation described here satisfies these constraints.

We use Scheme vectors to represent records. The first value in the vector identifies the datatype, the second value identifies the variant, and the rest of the values are the field values. For instance, continuing the Tree example above, we would finish the constructors with

<construct an empty-tree>=
(vector 'Tree 'empty-tree)

and

<construct a node>=
(vector 'Tree 'node datum left right)

Creating identifiers to act as names for the predicate and assorted constructors is a repetitive task, so some helper procedures are useful. The challenge is that the names will be constructed from pieces of different types: parts of the names will come in as syntax objects; we will provide other parts as strings. Coercing all the parts to strings simplifies matters. Thus, one binding pair in the letrec is

<helper procedures for macro>=
[as-string (lambda (x)
             (cond
              [(identifier? x) (as-string (syntax-object->datum x))]
              [(symbol? x) (symbol->string x)]
              [(string? x) x]
              [else (error 'define-datatype
                           "cannot construct an identifier using ~s"
                           x)]))]

If the argument is an identifier, syntax-object->datum gives us the name of the identifier as a Scheme symbol.

Constructing an identifier involves converting all the given parts to strings, concatenating the strings, and turning the result into a syntax object that represents an identifier. The datum->syntax-object procedure produces a syntax object for a Scheme datum. Since syntax objects encapsulate lexical information, an additional argument is required to specify the context from which the object should pretend to have come. This argument is typically some identifier from which datum->syntax-object can borrow a context. Here, we will use the name of the datatype for this purpose.

<helper procedures for macro>+=
[construct-id (lambda (context . parts)
                (datum->syntax-object context
                                      (string->symbol
                                       (apply string-append
                                              (map as-string parts)))))]

Now we have the tools at our disposal to start building names. The simplest is the name of the predicate for the whole datatype, as it is based on the name of the datatype:

<construct name for datatype predicate>=
(construct-id (syntax ?name) (syntax ?name) "?")

The names for the variant constructors and recyclers will take a little more work, because a whole set of names must be produced, two names for each variant.

<construct names for variant constructors>=
(map (lambda (variant)
       (construct-id (syntax ?name) "make-" variant))
     (syntax (?variant ...)))

<construct names for recyclers>=
(map (lambda (variant)
       (construct-id (syntax ?name) "recycle-as-" variant))
     (syntax (?variant ...)))

Normalizing the field descriptions requires us to make sure that every field has a predicate. Any fields that do not have predicates get the always-true predicate by default:

<normalize field descriptions>=
(map (lambda (fields)
       (map (lambda (field)
              (syntax-case field ()
               [(?field-name ?predicate)
                (identifier? (syntax ?field-name))
                field]
               [?field-name
                (identifier? (syntax ?field-name))
                (syntax (?field-name (lambda (x) #t)))]))
            fields))
     (syntax ((?field ...) ...)))

The preamble to the macro is now complete, except for computing the field counts for each variant. These counts need to be converted to syntax objects in order to be used in the output of the macro.

<compute number of fields in each variant>=
(map (lambda (fields)
       (datum->syntax-object (syntax ?name) (length fields)))
     (syntax ((?field ...) ...)))

The entire preamble to the macro is complete now. With so much already taken care of, the output definitions are fairly straightforward. For each variant, one constructor and one recycler are defined, using the names we just produced for them.

<definitions for each variant>=
(begin
  (define ?constructor
    (lambda (?field-name ...)
      (if (and (?predicate ?field-name) ...)
          (vector '?name '?variant ?field-name ...)
          (error '?constructor "invalid argument types"))))
  <definition of recycler form>)

The use of ``...'' in the output of the main macro body causes these constructor definitions to be replicated for each variant, using the appropriate substitutions for the pattern variables. We have put off the definitions of the recyclers until later, since they require the macro to define a new macro.

The predicate for the whole datatype is also simple: an object is an instance of the datatype if it is a vector whose first value is the name of the datatype. For example, we can complete the definition of Tree? with:

<test for Tree>=
(and (vector? x)
     (>= (vector-length x) 2)
     (eqv? (vector-ref x 0) 'Tree))

Generalizing from this example, we get:

<definition of datatype predicate>=
(define ?datatype-predicate
  (lambda (x)
    (and (vector? x)
         (>= (vector-length x) 2)
         (eqv? (vector-ref x 0) '?name))))

The only outstanding definitions are those of the recyclers and the case handler. Unlike the other definitions, these provide new syntax, not just new procedures. Thus, the define-datatype macro produces other macro definitions in its output. It is easy to get confused trying to follow the code in this kind of situation. To clarify the code, we use double question marks and double underscores for pattern variables that appear in the output. Pattern variables with single question marks are filled in during the expansion of define-datatype and become part of the definitions of the output macros. The macro system itself requires the use of ``(... ...)'' to produce a single ``...'' in the output.

During the expansion of a use of a recycler, the field names in the update pairs need to be recognized and converted to vector indices. This requires the field names to be available as the macro is expanded. Thus, we bind a list of the field names and close over it with the syntax transformer. For instance, the transformer procedure for recycle-as-node is:

<syntax transformer for recycle-as-node>=
(let ([fields (list 'datum 'left 'right)])
  (letrec (<helper procedures for recycler>)
    (lambda (x)
      (syntax-case x ()
       [(__ ??old [??delta-field ??delta-value] (... ...))
        (with-syntax ([(??delta-index (... ...))
                       <compute delta indices>])
          (syntax
           (let ([new ??old])
             (if (and (vector? new)
                      (= (vector-length new) 5))
                 (begin
                   (vector-set! new 0 'Tree)
                   (vector-set! new 1 'node)
                   (vector-set! new ??delta-index ??delta-value)
                   (... ...)
                   new)
                 (error 'recycle-as-node
                        "~s cannot be recycled as a ~a"
                        new
                        'node)))))]))))

There are still a couple of parts missing from the preceding code, which we will fill in later. For now, let us consider the desired expansion of the following:

<use of recycle-as-node>=
(recycle-as-node (f x) [right (empty-tree)])

The expression (f x) should evaluate to a record that is the same size as a node record. The entire recycle-as-node expression should expand to:

<expansion of recycle-as-node>=
(let ([new (f x)])
  (if (and (vector? new)
           (= (vector-length new) 5))
      (begin
        (vector-set! new 0 'Tree)
        (vector-set! new 1 'node)
        (vector-set! new 4 (empty-tree))
        new)
      (error 'recycle-as-node
             "~s cannot be recycled as a ~a"
             new
             'node)))

More generally, a recycler macro definition should be like this:

<definition of recycler form>=
(define-syntax ?recycler
  (let ([fields (list '?field-name ...)])
    (letrec (<helper procedures for recycler>)
      (lambda (x)
        (syntax-case x ()
         [(__ ??old [??delta-field ??delta-value] (... ...))
          (with-syntax ([(??delta-index (... ...))
                         <compute delta indices>])
            (syntax
             (let ([new ??old])
               (if (and (vector? new)
                        (= (vector-length new) (+ ?field-count 2)))
                   (begin
                     (vector-set! new 0 '?name)
                     (vector-set! new 1 '?variant)
                     (vector-set! new ??delta-index ??delta-value)
                     (... ...)
                     new)
                   (error '?recycler
                          "~s cannot be recycled as a ~a"
                          new
                          '?variant)))))])))))

The recycler ``re-tags'' the record with the appropriate type and variant information; then, it replaces the values in any fields for which update values have been specified. The indices of those fields can be found by looking up where they appear in fields. After computing each index, it must be converted into a syntax object, so that it can be dropped into the output code.

<compute delta indices>=
(map (lambda (field)
       (datum->syntax-object (syntax __)
         (lookup-index (syntax-object->datum field) fields 2)))
     (syntax (??delta-field (... ...))))

The lookup-index procedure can simply count up to the location where it finds the requested field, starting at the value passed in (i.e., 2).

<helper procedures for recycler>=
[lookup-index (lambda (target source start)
                (cond
                 [(null? source)
                  (error '?recycler
                         "unrecognized field name: ~a"
                         source)]
                 [(eqv? target (car source))
                  start]
                 [else
                  (lookup-index target (cdr source) (+ start 1))]))]

Now we move on to the definition of the case handler. During the expansion of a use of the case handler, the variant names in each clause need to be checked against those defined in the datatype. If any do not match, an error is signaled. Also, if the number of ``pattern'' parameters provided in the variant clause does not match the number for fields in the variant, an error is signaled. Thus, the case handler needs access to the variant names and record lengths at time of expansion of a use of the case handler. Getting these into the macro requires the creation of a table, bound locally around the case handler macro transformer.

For each variant clause, if the input value is of that variant type, the ``pattern'' parameters provided for the variant will be bound to the corresponding values in the record. Of course, the expression passed into the case macro should only be evaluated once, not once for each variant, so the macro binds the value of the expression to a local variable.

<definition of type-case handler>=
(define-syntax ?name
  (letrec (<case macro helpers>)
    (let ([variant-table (list (cons '?variant '?field-count) ...)])
      (lambda (x)
        (syntax-case x ()
         [(__ ??exp ??clause0 ??clause1 (... ...))
          (with-syntax
              ([??body <case macro body>])
            (syntax
             (let ([case-val ??exp])
               (if (?datatype-predicate case-val)
                   (let ([variant-tag (vector-ref case-val 1)])
                     ??body)
                   (error 'type-case
                          "~s is not a ~a"
                          case-val
                          '?name)))))])))))

The body of the case handler macro will expand each variant clause into an if that tests whether case-val is of that variant type. If it is, the parameters are bound and the rest of the clause is executed. If not, evaluation proceeds to the next clause. The else clause can only appear last. Continuing our Tree example, consider the following use of type-case:

<use of Type-case>=
(type-case Tree (f x)
  [(node d l r) node-result]
  [else other-result])

This should expand into

<use of the Tree case handler>=
(Tree (f x)
  [(node d l r) node-result]
  [else other-result])

which should expand into something like this:

<expansion of Tree case handler>=
(let ([case-val (f x)])
  (if (Tree? case-val)
      (let ([variant-tag (vector-ref case-val 1)])
        (if (eqv? variant-tag 'node)
            (let ([d (vector-ref case-val 2)]
                  [l (vector-ref case-val 3)]
                  [r (vector-ref case-val 4)])
              node-result)
            other-result))
      (error 'type-case "~s is not a ~a" case-val 'Tree)))

Generating the sequence of tests is very similar to what must be done in the standard Scheme case macro. The macro is somewhat long (due mostly to handling the else clause and requiring it to appear last), but it is straightforward.

<case macro body>=
(let caseloop ([clause0 (syntax ??clause0)]
               [others (syntax (??clause1 (... ...)))])
  (cond
   [(null? others)
    (syntax-case clause0 (else)
     [(else ??body (... ...))
      (syntax (begin ??body (... ...)))]
     [((??variant ??var (... ...)) ??body (... ...))
      (with-syntax ([(??index (... ...))
                     <construct indices for fields>])
        (syntax
         (if (eqv? variant-tag '??variant)
             (let ([??var (vector-ref case-val ??index)] (... ...))
               ??body (... ...))
             (error 'type-case "no matching clause for ~s" case-val))))])]
   [else
    (syntax-case clause0 (else)
     [((??variant ??var (... ...)) ??body (... ...))
      (with-syntax ([??rest
                     (caseloop (car others) (cdr others))]
                    [(??index (... ...))
                     <construct indices for fields>])
        (syntax
         (if (eqv? (vector-ref case-val 1) '??variant)
             (let ([??var (vector-ref case-val ??index)] (... ...))
               ??body (... ...))
             ??rest)))])]))

The missing piece in this long macro is the construction of the indices at which to find the field values in the record. These are simply (the syntax objects representing) the numbers 2, 3, ..., where the number of indices equals the number of fields.

<construct indices for fields>=
(build-indices (syntax ?name)
               (syntax-object->datum (syntax ??variant))
               (syntax (??var (... ...)))
               variant-table)

Of course, build-indices is another simple recursive procedure. Like the procedure above that produced record lengths, this one must convert the Scheme numbers into syntax objects:

<case macro helpers>=
[build-indices (lambda (context variant-name vars table)
                 (let ([entry (assoc variant-name table)])
                   (if entry
                       (let ([field-count (cdr entry)])
                         (if (= field-count (length vars))
                             (make-range context 2 (+ field-count 2))
                             (error 'type-case
                                    "incorrect number of fields in ~a"
                                    variant-name)))
                       (error 'type-case
                              "~a is not a variant of ~a"
                              variant-name
                              (syntax-object->datum (syntax ?name))))))]
[make-range (lambda (context start end)
              (cond
               [(>= start end) '()]
               [else (cons (datum->syntax-object context start)
                           (make-range context (+ start 1) end))]))]

This is the only helper procedure we need for the case handler, and it completes the definition of the define-datatype macro.

References

[1] Revised^5 report on the algorithmic language Scheme. In progress.

[2] Robin Milner, Mads Tofte, and Robert Harper. The Definition of Standard ML. MIT Press, 1990.

[3] Greg Morrisett, Matthias Felleisen, and Robert Harper. Abstract models of memory management. In FPCA '95: 7^th International Conference on Functional Programming Languages and Computer Architecture, pages 66--77, La Jolla, June 1995. ACM Press.

[4] Jonathan Sobel and Daniel P. Friedman. Recycling continuations. In International Conference on Functional Programming '98, Baltimore, September 1998. To appear.

<datatype.ss>=
;;;----------------------------------------------------------------------
;;; A datatype macro for Scheme:

<define-datatype macro>
<type-case macro>