This report describes a new form for Scheme that resembles ML's
datatype form. This section explains the rationale behind the
form's design and demonstrates its use in small examples. The second
section explains how to implement the datatype form as a Scheme
macro.
Scheme does not have a standard, built-in way to define new types of objects. In fact, the notion of type is only loosely defined in the Revised^5 Report [cite r5rs]. It says:
No object satisfies more than one of the following predicates:
boolean? pair?
symbol? number?
char? string?
vector? port?
procedure?
These predicates define the types boolean, pair,
symbol, number, char (or character),
string, vector, port, and procedure. The
empty list is a special object of its own type; it satisfies none of
the above predicates.
In Scheme, then, a type is simply the description of those objects
that satisfy some particular type predicate (and no other type
predicates).
The set of predicates/types listed above is closed in pure Scheme, in the sense that no standard Scheme procedures (or special forms) are specified to produce objects whose types are outside this set. In many Scheme implementations, however, it is possible to define new types of a limited form, usually record types. For example, some implementations allow definitions similar to:
(define-record tree-node (datum left right))Such definitions typically supply a new type-predicate (
tree-node?), as well as a constructor and a set of accessors (and
sometimes mutators) for the new record type. This minimal approach to
type definitions is, in some ways, appropriate to the ``Ockham's
Razor'' design of Scheme. On the other hand, it does not provide for
one of the most common idioms in type definitions: the sum-of-products
form. For example, we were able to describe the type of tree nodes in
the preceding example, but there is no way to descibe the type of
trees, other than by placing the definition of an empty-tree type
adjacent to the definition of tree-node and adding a comment in
the code, claiming that the two types ``go together'' somehow.
datatype Form
To overcome the deficiencies of the status quo, we turn to ML
for inspiration [cite ML-manual]. The datatype form is the most
commonly used means of defining new types in ML; we adapt it here for
use in Scheme. The define-datatype form does allow us to define a
Tree type: [In many languages, it is standard---or even
necessary---to name types with capitalized identifiers. Although
Scheme is not case-sensitive, we stick to tradition here and use
Tree, instead of tree.]
<Tree datatype (preliminary)>= (define-datatype Tree (empty-tree) (node datum left right))
Now it is clear in the code that the Tree type consists of two
variants: the empty-tree and the node. The
define-datatype form defines a predicate Tree?, in keeping
with the Scheme notion of types. It also defines the constructors
make-empty-tree and make-node, used in expressions like
(make-node 5 t1 t2).
In order to distinguish between the variants and extract the
values of their fields (just the fields of node, in this example),
we use the type-case form.
Suppose we have a tree of numbers, like this:
<tree of numbers>=
(make-node 2
(make-node 1 (make-empty-tree) (make-empty-tree))
(make-node 3 (make-empty-tree) (make-empty-tree)))
If we want to sum all the numbers in the tree, we can write a
procedure Tree-sum, which uses type-case:
<sum the contents of a tree>=
(define Tree-sum
(lambda (t)
(type-case Tree t
((empty-tree) 0)
((node d l r) (+ d
(Tree-sum l)
(Tree-sum r))))))
There is no loss of generality in having only the define-datatype
form and eliminating the define-record form, because a datatype
with only one variant is just a record type:
<student dataype>= (define-datatype Student (student name major GPA))
The define-datatype form, as described in the preceding section, does not
adequately capture all of the type information that we might want to
include with the Tree datatype. For instance, we might wish to
assert that it is nonsensical to construct a node with a boolean or a
port for its left subtree; subtrees should always be Trees. To
enable this constraint to be expressed in the define-datatype form, we
turn again to Scheme's correlation of predicates and types:
<tree datatype>= (define-datatype Tree (empty-tree) (node datum (left Tree?) (right Tree?)))
In this new definition of the Tree type, we have extended the
description of the left and right fields to include predicates,
which constrain the values stored in those fields. If we wanted to
make sure that Trees can only contain numbers, we could replace
datum with (datum number?). Fields without predicates get the
default, always-true predicate: (lambda (x) #t). Now, if someone
attempts to construct a node whose subtrees are not Trees, an
error will be signaled.
At times, it can be useful to recognize that some record is no longer being used and to ``recycle'' it as another record. This is precisely the technique used in a paper that describes stack-less data-structure construction [cite sobel-friedman:98]. To support this application, we add recycling constructors, which reuse existing records, instead of allocating new ones. Recycling constructors perform a kind of ``coercion-with-update'' on records.
As an example, suppose we define a record-type that contains a
Tree, it's height, and a flag saying whether the tree is balanced
or not:
<tree information datatype>= (define-datatype Tree-Info (tree-info (height integer?) (tree Tree?) (balanced boolean?)))
If we have constructed a tree-info with
<make a tree-info>=
(define ti (make-tree-info 3 some-tree #t))
and we are sure that the tree-info record will never be needed
again, [We can be sure that a record is no longer useful in
several garbage-collection-style ways [cite morrisett-etc:95], via
linear typing, are simply by design [cite sobel-friedman:98].]
then we can use its memory for a node record:
<recycle atree-infoas anode>= (recycle-as-node ti [right (make-empty-tree)])
This constructs a node with 3 as its datum,
some-tree as its left subtree, and an empty-tree as its right
subtree.
The following code uses Chez Scheme's syntax-case macro expansion
system. At macro-expansion time, Chez Scheme treats code not as
lists, but as syntax objects. The syntax-case form is used
to match and dissect syntax objects, and the syntax form is used
to construct new syntax objects. Defining a macro really means
defining a transformer of
syntax-objects.
The define-datatype form needs to expand into a sequence of
definitions for four different kinds of operators: variant
constructors, ``recyclers,'' a predicate, and something that performs
the ``case'' action. Accessors for individual fields in a variant do
not need to be externally visible, nor do predicates for variants,
since the type-case form can do the same work. Thus, we would
like the Tree datatype definition to expand into something like
this:
<expansion ofTreedatatype>= (begin (begin (define make-empty-tree (lambda () <construct anempty-tree>)) (define-syntax recycle-as-empty-tree <syntax transformer forrecycle-as-empty-tree>)) (begin (define make-node (lambda (datum left right) (if (and (Tree? left) (Tree? right)) <construct anode> (error 'node "invalid argument types")))) (define-syntax recycle-as-node <syntax transformer forrecycle-as-node>)) (define Tree? (lambda (x) <testxforTree-ness>)) (define-syntax Tree <thetype-casehandler forTrees>))
And we would like the type-case form, when working with a
Tree, to make Tree's handler (which we call Tree) do the
work. That is,
<an example oftype-caseforTree>= (type-case Tree t ((empty-tree) 0) ((node d l r) (+ d (Tree-sum l) (Tree-sum r))))
should expand into
<expansion oftype-caseforTree>= (Tree t ((empty-tree) 0) ((node d l r) (+ d (Tree-sum l) (Tree-sum r))))
Thus the type-case macro will be
<type-case macro>=
(define-syntax type-case
(lambda (x)
(syntax-case x ()
((_ ?type-name ?exp ?clause ...)
(syntax (?type-name ?exp ?clause ...))))))
and the basic form of the define-datatype macro will be as follows:
<define-datatype macro>=
(define-syntax define-datatype
(letrec (<helper procedures for macro>)
(lambda (x)
(syntax-case x ()
[(_ ?name (?variant ?field ...) ...)
(and (identifier? (syntax ?name))
(andmap identifier? (syntax (?variant ...))))
(with-syntax ([(?constructor ...)
<construct names for variant constructors>]
[(?recycler ...)
<construct names for recyclers>]
[?datatype-predicate
<construct name for datatype predicate>]
[(?field-count ...)
<compute number of fields in each variant>]
[(((?field-name ?predicate) ...) ...)
<normalize field descriptions>])
(syntax
(begin
<definitions for each variant> ...
<definition of datatype predicate>
<definition of type-case handler>)))]))))
It will be useful in more than one place to have syntax objects
representing the number of fields in each variant, so we will compute
and bind these in advance. The expression that immediately follows
the pattern in syntax-case acts as a ``fender.'' It provides
additional contraints, which must be met in order for that case to
count as a match. For our purposes, we want to claim that the
define-datatype form is only valid if the name of the type and the
names of the variants are all legal identifiers.
Earlier, we quoted from the passage in the Revised^5 Report that required the predicates for the various Scheme types to be disjoint. Unless we have the power to change the underlying representation of Scheme objects (and the garbage collector, and so on), we will not be able to satisfy this requirement with the new types generated by a macro. The best we can hope for is that the new types be disjoint from all but one built-in type and that all new types be disjoint from each other. The implementation described here satisfies these constraints.
We use Scheme vectors to represent records. The first value in the
vector identifies the datatype, the second value identifies the
variant, and the rest of the values are the field values. For
instance, continuing the Tree example above, we would finish the
constructors with
<construct an empty-tree>=
(vector 'Tree 'empty-tree)
and
<construct a node>=
(vector 'Tree 'node datum left right)
Creating identifiers to act as names for the predicate and assorted
constructors is a repetitive task, so some helper procedures are
useful. The challenge is that the names will be constructed from
pieces of different types: parts of the names will come in as syntax
objects; we will provide other parts as strings. Coercing all the
parts to strings simplifies matters. Thus, one binding pair in the
letrec is
<helper procedures for macro>=
[as-string (lambda (x)
(cond
[(identifier? x) (as-string (syntax-object->datum x))]
[(symbol? x) (symbol->string x)]
[(string? x) x]
[else (error 'define-datatype
"cannot construct an identifier using ~s"
x)]))]
If the argument is an identifier, syntax-object->datum gives us
the name of the identifier as a Scheme symbol.
Constructing an identifier involves converting all the given
parts to strings, concatenating the strings, and turning the result
into a syntax object that represents an identifier. The
datum->syntax-object procedure produces a syntax object for a
Scheme datum. Since syntax objects encapsulate lexical information,
an additional argument is required to specify the context from which
the object should pretend to have come. This argument is typically
some identifier from which datum->syntax-object can borrow a
context. Here, we will use the name of the datatype for this
purpose.
<helper procedures for macro>+=
[construct-id (lambda (context . parts)
(datum->syntax-object context
(string->symbol
(apply string-append
(map as-string parts)))))]
Now we have the tools at our disposal to start building names. The simplest is the name of the predicate for the whole datatype, as it is based on the name of the datatype:
<construct name for datatype predicate>= (construct-id (syntax ?name) (syntax ?name) "?")
The names for the variant constructors and recyclers will take a little more work, because a whole set of names must be produced, two names for each variant.
<construct names for variant constructors>=
(map (lambda (variant)
(construct-id (syntax ?name) "make-" variant))
(syntax (?variant ...)))
<construct names for recyclers>=
(map (lambda (variant)
(construct-id (syntax ?name) "recycle-as-" variant))
(syntax (?variant ...)))
Normalizing the field descriptions requires us to make sure that every field has a predicate. Any fields that do not have predicates get the always-true predicate by default:
<normalize field descriptions>=
(map (lambda (fields)
(map (lambda (field)
(syntax-case field ()
[(?field-name ?predicate)
(identifier? (syntax ?field-name))
field]
[?field-name
(identifier? (syntax ?field-name))
(syntax (?field-name (lambda (x) #t)))]))
fields))
(syntax ((?field ...) ...)))
The preamble to the macro is now complete, except for computing the field counts for each variant. These counts need to be converted to syntax objects in order to be used in the output of the macro.
<compute number of fields in each variant>=
(map (lambda (fields)
(datum->syntax-object (syntax ?name) (length fields)))
(syntax ((?field ...) ...)))
The entire preamble to the macro is complete now. With so much already taken care of, the output definitions are fairly straightforward. For each variant, one constructor and one recycler are defined, using the names we just produced for them.
<definitions for each variant>=
(begin
(define ?constructor
(lambda (?field-name ...)
(if (and (?predicate ?field-name) ...)
(vector '?name '?variant ?field-name ...)
(error '?constructor "invalid argument types"))))
<definition of recycler form>)
The use of ``...'' in the output of the main macro body causes
these constructor definitions to be replicated for each variant, using
the appropriate substitutions for the pattern variables. We have put
off the definitions of the recyclers until later, since they require
the macro to define a new macro.
The predicate for the whole datatype is also simple: an object is an
instance of the datatype if it is a vector whose first value is the
name of the datatype. For example, we can complete the definition of
Tree? with:
<test for Tree>=
(and (vector? x)
(>= (vector-length x) 2)
(eqv? (vector-ref x 0) 'Tree))
Generalizing from this example, we get:
<definition of datatype predicate>=
(define ?datatype-predicate
(lambda (x)
(and (vector? x)
(>= (vector-length x) 2)
(eqv? (vector-ref x 0) '?name))))
The only outstanding definitions are those of the recyclers and the
case handler. Unlike the other definitions, these provide new syntax,
not just new procedures. Thus, the define-datatype macro produces
other macro definitions in its output. It is easy to get confused
trying to follow the code in this kind of situation. To clarify the
code, we use double question marks and double underscores for pattern
variables that appear in the output. Pattern variables with single
question marks are filled in during the expansion of
define-datatype and become part of the definitions of the output
macros. The macro system itself requires the use of ``(... ...)''
to produce a single ``...'' in the output.
During the expansion of a use of a recycler, the field names in
the update pairs need to be recognized and converted to vector
indices. This requires the field names to be available as the macro
is expanded. Thus, we bind a list of the field names and close over
it with the syntax transformer. For instance, the transformer
procedure for recycle-as-node is:
<syntax transformer for recycle-as-node>=
(let ([fields (list 'datum 'left 'right)])
(letrec (<helper procedures for recycler>)
(lambda (x)
(syntax-case x ()
[(__ ??old [??delta-field ??delta-value] (... ...))
(with-syntax ([(??delta-index (... ...))
<compute delta indices>])
(syntax
(let ([new ??old])
(if (and (vector? new)
(= (vector-length new) 5))
(begin
(vector-set! new 0 'Tree)
(vector-set! new 1 'node)
(vector-set! new ??delta-index ??delta-value)
(... ...)
new)
(error 'recycle-as-node
"~s cannot be recycled as a ~a"
new
'node)))))]))))
There are still a couple of parts missing from the preceding code, which we will fill in later. For now, let us consider the desired expansion of the following:
<use of recycle-as-node>=
(recycle-as-node (f x) [right (empty-tree)])
The expression (f x) should evaluate to a record that is the
same size as a node record. The entire recycle-as-node
expression should expand to:
<expansion of recycle-as-node>=
(let ([new (f x)])
(if (and (vector? new)
(= (vector-length new) 5))
(begin
(vector-set! new 0 'Tree)
(vector-set! new 1 'node)
(vector-set! new 4 (empty-tree))
new)
(error 'recycle-as-node
"~s cannot be recycled as a ~a"
new
'node)))
More generally, a recycler macro definition should be like this:
<definition of recycler form>=
(define-syntax ?recycler
(let ([fields (list '?field-name ...)])
(letrec (<helper procedures for recycler>)
(lambda (x)
(syntax-case x ()
[(__ ??old [??delta-field ??delta-value] (... ...))
(with-syntax ([(??delta-index (... ...))
<compute delta indices>])
(syntax
(let ([new ??old])
(if (and (vector? new)
(= (vector-length new) (+ ?field-count 2)))
(begin
(vector-set! new 0 '?name)
(vector-set! new 1 '?variant)
(vector-set! new ??delta-index ??delta-value)
(... ...)
new)
(error '?recycler
"~s cannot be recycled as a ~a"
new
'?variant)))))])))))
The recycler ``re-tags'' the record with the appropriate type and
variant information; then, it replaces the values in any fields for
which update values have been specified. The indices of those fields
can be found by looking up where they appear in fields. After
computing each index, it must be converted into a syntax object, so
that it can be dropped into the output code.
<compute delta indices>=
(map (lambda (field)
(datum->syntax-object (syntax __)
(lookup-index (syntax-object->datum field) fields 2)))
(syntax (??delta-field (... ...))))
The lookup-index procedure can simply count up to the location
where it finds the requested field, starting at the value passed in
(i.e., 2).
<helper procedures for recycler>=
[lookup-index (lambda (target source start)
(cond
[(null? source)
(error '?recycler
"unrecognized field name: ~a"
source)]
[(eqv? target (car source))
start]
[else
(lookup-index target (cdr source) (+ start 1))]))]
Now we move on to the definition of the case handler. During the expansion of a use of the case handler, the variant names in each clause need to be checked against those defined in the datatype. If any do not match, an error is signaled. Also, if the number of ``pattern'' parameters provided in the variant clause does not match the number for fields in the variant, an error is signaled. Thus, the case handler needs access to the variant names and record lengths at time of expansion of a use of the case handler. Getting these into the macro requires the creation of a table, bound locally around the case handler macro transformer.
For each variant clause, if the input value is of that variant type, the ``pattern'' parameters provided for the variant will be bound to the corresponding values in the record. Of course, the expression passed into the case macro should only be evaluated once, not once for each variant, so the macro binds the value of the expression to a local variable.
<definition of type-case handler>=
(define-syntax ?name
(letrec (<case macro helpers>)
(let ([variant-table (list (cons '?variant '?field-count) ...)])
(lambda (x)
(syntax-case x ()
[(__ ??exp ??clause0 ??clause1 (... ...))
(with-syntax
([??body <case macro body>])
(syntax
(let ([case-val ??exp])
(if (?datatype-predicate case-val)
(let ([variant-tag (vector-ref case-val 1)])
??body)
(error 'type-case
"~s is not a ~a"
case-val
'?name)))))])))))
The body of the case handler macro will expand each variant clause into an
if that tests whether case-val is of that variant type. If it
is, the parameters are bound and the rest of the clause is executed.
If not, evaluation proceeds to the next clause. The else clause
can only appear last. Continuing our Tree example, consider the
following use of type-case:
<use of Type-case>=
(type-case Tree (f x)
[(node d l r) node-result]
[else other-result])
This should expand into
<use of the Tree case handler>=
(Tree (f x)
[(node d l r) node-result]
[else other-result])
which should expand into something like this:
<expansion of Tree case handler>=
(let ([case-val (f x)])
(if (Tree? case-val)
(let ([variant-tag (vector-ref case-val 1)])
(if (eqv? variant-tag 'node)
(let ([d (vector-ref case-val 2)]
[l (vector-ref case-val 3)]
[r (vector-ref case-val 4)])
node-result)
other-result))
(error 'type-case "~s is not a ~a" case-val 'Tree)))
Generating the sequence of tests is very similar to what must be
done in the standard Scheme case macro. The macro is somewhat
long (due mostly to handling the else clause and requiring it to
appear last), but it is straightforward.
<case macro body>=
(let caseloop ([clause0 (syntax ??clause0)]
[others (syntax (??clause1 (... ...)))])
(cond
[(null? others)
(syntax-case clause0 (else)
[(else ??body (... ...))
(syntax (begin ??body (... ...)))]
[((??variant ??var (... ...)) ??body (... ...))
(with-syntax ([(??index (... ...))
<construct indices for fields>])
(syntax
(if (eqv? variant-tag '??variant)
(let ([??var (vector-ref case-val ??index)] (... ...))
??body (... ...))
(error 'type-case "no matching clause for ~s" case-val))))])]
[else
(syntax-case clause0 (else)
[((??variant ??var (... ...)) ??body (... ...))
(with-syntax ([??rest
(caseloop (car others) (cdr others))]
[(??index (... ...))
<construct indices for fields>])
(syntax
(if (eqv? (vector-ref case-val 1) '??variant)
(let ([??var (vector-ref case-val ??index)] (... ...))
??body (... ...))
??rest)))])]))
The missing piece in this long macro is the construction of the indices at which to find the field values in the record. These are simply (the syntax objects representing) the numbers 2, 3, ..., where the number of indices equals the number of fields.
<construct indices for fields>=
(build-indices (syntax ?name)
(syntax-object->datum (syntax ??variant))
(syntax (??var (... ...)))
variant-table)
Of course, build-indices is another simple recursive procedure.
Like the procedure above that produced record lengths, this one must
convert the Scheme numbers into syntax objects:
<case macro helpers>=
[build-indices (lambda (context variant-name vars table)
(let ([entry (assoc variant-name table)])
(if entry
(let ([field-count (cdr entry)])
(if (= field-count (length vars))
(make-range context 2 (+ field-count 2))
(error 'type-case
"incorrect number of fields in ~a"
variant-name)))
(error 'type-case
"~a is not a variant of ~a"
variant-name
(syntax-object->datum (syntax ?name))))))]
[make-range (lambda (context start end)
(cond
[(>= start end) '()]
[else (cons (datum->syntax-object context start)
(make-range context (+ start 1) end))]))]
This is the only helper procedure we need for the case handler, and it
completes the definition of the define-datatype macro.
[1] Revised^5 report on the algorithmic language Scheme. In progress.
[2] Robin Milner, Mads Tofte, and Robert Harper. The Definition of Standard ML. MIT Press, 1990.
[3] Greg Morrisett, Matthias Felleisen, and Robert Harper. Abstract models of memory management. In FPCA '95: 7^th International Conference on Functional Programming Languages and Computer Architecture, pages 66--77, La Jolla, June 1995. ACM Press.
[4] Jonathan Sobel and Daniel P. Friedman. Recycling continuations. In International Conference on Functional Programming '98, Baltimore, September 1998. To appear.
<datatype.ss>= ;;;---------------------------------------------------------------------- ;;; A datatype macro for Scheme: <define-datatype macro> <type-case macro>