Syntax extensions without Camlp4

In this post, I'd like to propose a new approach to extending the syntax of OCaml that does not depend on Camlp4. This proposal is the result of discussions with Xavier Clerc, Fabrice Le Fessant, Maxence Guesdon, and has been presented to the latest meeting of the Caml Consortium. It was also quickly mentioned by Xavier Leroy during his talk at the latest OCaml Users Meeting.

Camlp4 makes it possible to alter the concrete grammar recognized by the OCaml parser. New constructions can be added to the language provided that they can be mapped to the standard language in a purely syntactic way. Camlp4 is a complex beast. It implements its own unique parsing technology, it defines its own interpretion of the OCaml Abstract Syntax Tree (AST), and it allows code generators to use the concrete syntax of OCaml. These features might be needed or wanted in some cases, but I believe that a lot of existing syntax extensions, currently implemented with Camlp4, don't really need any of these complex things.

The proposal can be explained in two points:

  • Extend the OCaml syntax once and for all with attributes.
  • Syntax extensions are simply functions mapping an AST to another AST.

Attributes

What is an attribute? Just some new syntax to allow generic data to be attached to various kinds of nodes in the AST: expressions, type declarations, type expressions, methods, ... Somehow like Java annotations http://download.oracle.com/javase/tutorial/java/javaOO/annotations.html, or C# attributes (http://msdn.microsoft.com/en-us/library/aa288059.aspx, http://msdn.microsoft.com/en-us/library/z0w1kczw.aspx). One should decide what kind of data an attribute defines and what is the concrete syntax for attributes. I don't have a concrete proposal here, but for the sake of the discussion, let's assume that attributes share the same syntax as OCaml expressions and that they can be attached to a node in the AST with a @@ syntax. For instance, a type declaration with some attributes could be written:

1
2
3
4
  type my_type @@ [with_xml] =
     | A of int
     | B of string @@ {xml=base64}
     | C @@ {xml=ignore}

Note that [with_xml], {xml=base64} and {xml=ignore} are three syntactically valid OCaml expressions. They are here used as attributes, attached respectively to the type declaration, the argument of the B constructor and the C constructor itself.

Imagine that the OCaml parser is extended to accept such attributes. I mean, the real OCaml parser, the one implemented in parsing/parser.mly that produces AST described in parsing/parsetree.mli. This is quite easy to do. Just add some constructors or record fields to the definition of AST types and a few new rules in the parser itself. Once and for all. For now, one can assume that the OCaml type-checker simply ignores attributes (although one could also argue that it should propagate them to the typed tree, but this is another story).

Preprocessors = AST-transforming functions

The OCaml compiler has an option -pp to plug an external preprocessor. A preprocessor is an arbitrary command which will be called by the compiler; it receives the name of the source file as a command-line argument and it is supposed to dump its result to its standard output, either as source code in text form, or as a marshaled version of the OCaml AST. I propose to add a new command-line option -ppx. It also comes with an external command to be executed, but this command does not take a file name as argument. Instead, it receives an already parsed OCaml AST (as a marshaled value) and it must return an OCaml AST as well. Several -ppx commands can be used at the same time. The compiler parses the source file itself and then pipes it through each external commands specified with -ppx. Of course, the OCaml AST and parser have been extended to recognize attributes.

A typical use of Camlp4 is to generate some code based on type declarations. As an example, consider a syntax extension that generates XML parsers and pretty-printers from type declarations. This could be implemented quite simply as a -ppx preprocessor. Basically, the task is to write an OCaml function of type Parsetree.structure -> Parsetree.structure (one must of course give access to parsetree.cmi in order to compile this piece of code). The function would simply traverse the AST, detect some specific attributes that trigger or control its code generation (like the attributes in the example above) and produce new nodes in the AST. There is also some driver code to read a marshaled value from the standard input, convert it to a value of type Parsetree.structure, call the function above, and send a marshaled version of the resulting Parsetree.structure to the standard output. Probably a few lines of code to be robust (check magic numbers, etc). These lines and the traversal code is quite generic; I'm sure people will quickly define some reusable routines (as functions or classes).

And that's all.

I lied a little bit. The "XML parser and pretty-printer code generator" above does not really extend any syntax. It simply interprets some attributes already parsed but otherwise ignored by the regular OCaml compiler. Still, for many syntax extensions currently implemented with Camlp4, this would be good enough. And the ones that generate code from type declarations tend to be amongst the most interesting ones, in my opinion.

For a decent OCaml developper, reading parsetree.mli (from the OCaml distribution) and becoming familiar with the OCaml AST should be quite easy. An order of magnitude easier than learning Camp4 anyway. No need to learn a new parsing technology and other advanced concepts either.

Quotations

A long time ago, before the local-open was available in OCaml, I wrote a tiny Camlp4 extension that would simulate this feature. The extension would rewrite let open M in e into something like:

1
2
    let module XXX = struct open M let r = e end in XXX.r
  
How would such an extension be implemented using the proposal above? Well, one simply needs to encode the new construction with attributes. For instance, instead of let open M in e, one could write (() @@ open_in M) e, using an attribute on an expression node. One cannot use open because it is a keyword and the syntax looks quite ugly.

Here is an extra little proposal. In adddition to extending the OCaml syntax with attributes, one could also extend various kinds of syntactic items with a new "quotation" construction. Again, the syntax needs to be decided. For instance, let's use double curly braces. So the OCaml AST and parser would accept a new kind of expression (and other items), written {{...}}. The ... is not parsed by OCaml (one simply needs to agree on lexical conventions to detect its end), and it is represented simply as a string in the AST.

Contrary to attributes, the type-checker has no option but to reject quotations. But an external preprocessor could detect quotations and eliminate them (and part of their context). For instance, a preprocessor could detect {{open M}}; e and do the rewriting as above. Not as nice as a full fledged syntax extension, but maybe good enough.

Conclusion

A lot of questions still need to be addressed before this proposal could even be considered for inclusion into OCaml: what exactly are attributes (OCaml expressions, s-exp, ...), how they are attached to syntactic components and which syntactic categories are supported, whether quotations should be introduced in addition to attributes. The order in which preprocessors are applied is also an issue (one could imagine iterating until a fix point is reached).

Still, I wanted to describe these preliminaries ideas. Any feedback will be appreciated!

What about hygiene?

Syntactic extension via AST-to-AST transformers is probably an improvement over Campl4, but I wonder if you have any ideas about hygiene. The naive, gensym-based approach to hygiene is brittle and error-prone, so if you’re designing something new, it would be good to get that part of the story right.

Another use case

Here's another use case that is becoming quite popular at Jane Street. You can kick off a syntax extension from a type definition, without actually creating a new type definition. i.e.:
1
List.map ~f:<:to_sexp<int*string>> [1;2;3]
How would one propose writing this in the brave new annotations world?

Another use case

A solution that does not use attributes:
1
  List.map ~f:(to_sexp : int * string) [1; 2; 3]
The AST->AST rewritter would detect this special form (to_sexp : ...).

Something more robust, if attributes are defined as syntactically valid expressions:

1
  List.map ~f:(() @@ (to_sexp : int * string)) [1; 2; 3]
The form "() @@ attribute" could be replaced with something like quotations (but whose content would be parsed e.g. as an OCaml expression):
1
  List.map ~f:{@ (to_sexp : int * string) @} [1; 2; 3]
The advantage of this (and quotations) is that the compiler will fail with a clean error message if no preprocessor rewrites the special node (instead of type-checking the expression as ()).

Sounds good!

I like the general idea very much. Here are some of my initial ideas and reports on my personal experience.

Deriving code from type definitions

For this I've been developing a syntax for type definitions called ATD (Adjustable Type Definitions). The syntax is very close to OCaml but supports attributes (called annotations) on most types of nodes of the parse tree (type expression, type definition, record field name, ...).

The primary goal of ATD is to allow cross-language data exchange. For such applications we need to derive code other than OCaml based on the same type definitions without duplicating them.

Because the atd library does not use Camlp4, it always produces .mli and .ml files. These are large files and being able to review the generated code at no cost is a benefit. Parsing is done with ocamllex and menhir. Emitting indented code is done with a very simple module allowing me to create a tree of sufficiently-well indented lines of code (http://oss.wink.com/atd/atd-1.0.1/odoc/Atd_indent.html).

Here is an example of what ATD supports:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
(* This is an ATD file *)
 
type 'a tree = [ Node of ('a tree * 'a tree) | Leaf of 'a ]
 
type record = {
  name : string;
    (* Required field *)
 
  ~friends : string list <ocaml repr="array">;
    (*
      Optional field with a default value, by default the empty list.
 
      <ocaml repr="array"> is an annotation for the OCaml code generators
      that only them need to interpret.
    *)
 
  ?descr : string option;
    (* Optional field without a default value. *)
 
  tree : int tree;
}
Annotations (attributes) are placed after the item they refer to. It is a whitespace-separated list of XML-like elements without children. For example we could have two annotations on a single type expression:
1
int <ocaml repr="int64" validator="fun x -> x >= 0L"> <biniou repr="int64">
This is just to say that for this application, allowing two-level of attributes such as "ocaml"/"biniou" and "repr"/"validator" works fine in practice. It is not suitable, however, for containing very large code because double-quotes must be escaped.

More information on the goals of ATD: http://martin.jambon.free.fr/atd-biniou-intro.html
ATD manual: http://oss.wink.com/atd/atd-1.0.1/manual/atd-manual.html

Insertion of a snippet in a foreign language

Currently the only camlp4 syntax extension I allow myself to use in new code is mikmatch (http://martin.jambon.free.fr/micmatch.html). It basically takes pattern-match cases (match-with, function) containing regexp patterns for matching strings. The whole pattern matching is rewritten into a more complex expression that takes care of calling PCRE and binding identifiers to captured substrings. For example we can write:
1
2
3
4
5
6
match l with
    [ / (alpha+ as first_name) space+ (alpha+ as last_name) eos / ]
  | [ first_name; last_name ] ->
       Some (first_name, last_name)
  | _ ->
       None
Using attributes, it would be nice to be able to write something like this:
1
2
3
4
5
6
match l with
    [ {{ (alpha+ as first_name) space+ (alpha+ as last_name) eos }} ]
  | [ first_name; last_name ] ->
       Some (first_name, last_name)
  | _ ->
       None
I am not sure about the best syntax for attributes and for quotations. I'm also thinking of a Lisp-like list of standard OCaml lexemes instead of quotations. Parentheses, square and curly brackets would have to match and could be used at will without escaping:
1
_ @@ re ((alpha+ as first_name) space+ (alpha+ as last_name) eos)
What follows @@ would have the form: NAME SEQ where NAME is the name of the attribute and SEQ is its value, which is either a simple lexeme (ident, int, string, ...) or something between (), [] or {}. This would allow the reuse of existing OCaml keywords, without the escaping issues we have with camlp4-like quotations or double-quoted strings.

Extension of the OCaml grammar

Let's consider a try...finally construct that ideally looks like this:
1
2
try f () finally g ()
try f () with Foo -> h () finally g ()
With an attribute, we could write:
1
2
(f ()) @@ finally (g ())
(try f () with Foo -> h ()) @@ finally (g ())
which is not that bad at all!

This is the way to go

We discussed these ideas with Alain for some times now. And I want to express my wide support for these proposals. While this won't match all the features of Camlp4, it will provide with a much better way to do some of them. I support:
  • Attributes
  • -ppx to have pre-processing from Parsetree to Parsetree
Some support for quotations would be nice as well but I would leave them for later. For instance maybe having another lexical for strings would be enough. Indeed using "..."@@reg seems reasonable, except for the backslash escapings. I propose the following:
1
2
3
4
5
6
7
8
9
10
11
expression ::=
  | ...
  | '«' fr-elem* '»'
 
fr-elem ::=
  | any non '«', '»', '{', '}' character -- plain character
  | '{' expression '}'                   -- hole/antiquotation
  | '{«}'                                -- escaped '«'
  | '{»}'                                -- escaped '»'
  | '«{»'                                -- escaped '{'
  | '«}»'                                -- escaped '}'
Then the translation could work like this:
1
2
3
4
5
«bla»      -> ["bla"]
«bla\bli»  -> ["bla\\bli"]
«a{»}b{«}» -> ["a»b«"]
«a{b}c»    -> ["a"; b; "c"]
««}»«{»»   -> ["}{"]

I like the idea

P4 is powerful but its costs -- learning one for users and maintenance one for caml-dev, are pretty high. For example, as we have seen, some of the new syntax of OCaml 3.12.0 were not in the accompanied P4, and it took weeks to wait a patch for it: it makes me feel no bright future of P4 any more. On the other hand, writing Parsetree transformers is much easier for ppl who have some experience of hacking OCaml compiler, and the number of the such ppl should be x10 as many as those who understand P4 internals. Even for novices it gives a chance to learn OCaml internal. So I like your idea.

What are missing are probably a printer of Parsetree (it should be easy) and a way to change the behavior of the lexer. Sometimes I want to write PCRE easily just like Perl without messy backslashes, and to define operators with my favorite precedence and associativity. Such kind of things could be done by having a pluggable lexer which creates Parser.token stream. I am currently playing my OCaml lexer by parser combinators, and I can contribute something.

Quotations?

Aren't quotations enough for your example (regexps without messy backslashes)?

You could write directly:

{{rex:\[0-9][0-9][0-9]}}

and the regexp package would come with an AST rewriter that deals with such quotations.

Of course, we are just pushing the lexing issues farther; one still needs a way to escape }} within a quotation.

RE: Quotations ?

Ah, that's nice. But still I want to write something like $/\[0-9][0-9][0-9]/ just I am doing my OCaml scripting. :-)

Changing the lexer : bad for IDEs ?

An advantage of the approach I'm describing over Camlp4-based extensions is that tools such as IDEs, editor modes, syntax highlighters need to be adapted only once to deal with attributes (and quotations).

If you give some way to change the lexing rules, you basically loose this advantage.

I'm glad to see I'm not the

I'm glad to see I'm not the only one who doesn't like camlp4. However I don't like what you propose either.

The lispers understood something a long time ago: meta programming facilities should be part of the core language and share its scoping rules.

The main benefits are :

1) No need for external tools. This makes syntax extensions easy to use, once you know how to use libraries you know how to use syntax extensions. It also means simpler build systems.

2) Sources are self-describing. Like libraries it is clear from the source which syntax extensions are used, that knowledge is not in the build system.

3) With a scoping rule you get better extension compositionality, it doesn't mean you won't have problems but at least the result on a given source doesn't depend on the order of -ppx flags specified on the command line.

Maybe attributes are the good mechanism to extend the language but IMHO how they are defined and applied should be handled by the core language and the compiler.

Best,

Daniel

Orthogonal and incremental

Daniel, you are right that having the processing tools into the language directly would simplify their use. However, this point is mostly orthogonal to Alain's proposal: he brings a new metaprogramming facility, while you want a different way to deploy all metaprogramming facilities.

You will also understand that Alain is looking for the least disruptive path here, no backward-compatibility issue, restricted changes to the tools, etc. Your proposal on the contrary means important redesign of the tools and the language (bringing a "moment of evaluation" notion directly into the language also brings new complexity) and, given the conservativeness in the language evolution, this is just not going to happen. That said, I think the proposed change actually goes into your direction, because you will have a much easier time selling in-language integration of syntax extensions if the extension mechanism is simple, solid and limited in scope.

Nice

I'm glad you finally proposed this very interesting idea in the open.

I've long wished we had something like attributes to provide an easy (and safe) way to do common meta-programming tasks such as what is currently provided by type-conv and deriving.

I have personally (nearly) stopped using Camlp4 to change the OCaml syntax, because it is too complex, error-prone, hard to test correctly, and not easily accepted by the user (but mostly because of the complexified build setup, which your proposal doesn't change).
Current attributes-like Camlp4 extensions are nice because they encapsulate those difficulties in a small extension, and provide a flexible way to enhance our programs, along their foreseen use cases, without having to do the dangerous business ourselves.

Here is how I see the underlying idea of your proposal: instead of having a full-fledged system that can do everything (at syntactical level) but is too complex and fragile like Camlp4, try to isolate some restricted use of metaprogramming that are common, useful and can actually be accomplished in a simpler system, and devise those simpler systems for those specific purposes.

I am therefore quite dubious about your "quotation" proposal in its current state. Camlp4 Quotations (and antiquotations) are imho one of the very nice parts of Camlp4 (along with the Camlp4MapGenerator/Camlp4FoldGenerator that I hope to see reimplemented in your system).
The general idea is that quotations are free-form strings (beside the delimiters syntax on which we need a global agreement) that are transformed into a Camlp4 AST by a user-provided parser at preprocessing-time. Their behavior is very regular and reliable, yet very rich: a quotation in some context really behaves like a piece of OCaml code in this context; in particular, you don't need to know the inner working of the quotation processor to know if the code outside the quotation will type-check. In this respect, camlp4 quotations are modulars.

Your proposal regarding quotation is, in essence, to use them as markers for external processor, in order to direct changes on the outer syntax. I think this is a misuse of the quotation concept that would break their modularity and therefore make them unsafe, in a similar way that current syntax-modifying extensions are fragile (minus the parsing behavior errands).

I think it should be expected that the specialized syntax facilities (attributes and quotations) are *not* as expressive as the general but complex facility (Camlp4). By trying to stretch them to regain the lost expressiveness, you may reduce their benefits.

On an unrelated note, is it practical to represent comments as annotations in your proposal? That's something I would be very interested in. Comments are notoriously difficult to handle for syntax processors (ocamldoc and camlp4 both handle comments in an unsatisfying way), because they're too free-form. By using annotations for comments we could attach them to specific AST nodes, which would make them much more robust regarding external processing. Python docstring are a good example of this principle. This would also make it easier to envision more structured comments. Yet comments should be as easy to use as possible, and I fear (foo @@ comment "blah blah") may be a bit too heavy to find large adoption. What's your opinion on the question?

Quotations

I agree with you: the example I gave for quotations is not very nice. One could of course use quotations in the way you describe to expand sub-languages in a well-controlled way. Once we have the idea of having preprocessors be just AST rewriters, it is quite natural to let them process quotations as well and there is no way to enforce that they will simply expand quotations without looking at the context. It's more a matter of style.

Reimplementing something like ocamldoc on top of attributes is a very good idea. One needs to find a syntax for attributes which is light enough to support this case. If we keep attributes down to .cmi files (which should not be difficult), one could also have this ocamldoc replacement read only .cmi files (currently, ocamldoc needs to parse and type check interfaces again).