OCaml extensions at LexiFi, overidding record labels and constructors.

Alain Frisch

OCaml is a very important language at LexiFi. We use it as our primary internal development language. We also expose it to some of our power users: they can write their own applications or extend LexiFi Apropos (our flagship product) with custom addins written in OCaml.

Well, almost OCaml. Indeed, we are using our own version of the compilers. Compared to the official version from INRIA, it contains several handy extensions that we have implemented for our internal use. Being able to implement and use these extensions is extremely useful to us: controlling your own implementation language gives you a lot of power; several of our core technologies would have been much more difficult to implement without our extensions. We are always willing to share our work on the OCaml compilers. As a matter of fact, several of these extensions have already been integrated in OCaml 3.12: local opening of modules, stricter semantics for method overriding, first-class modules.

In this series of posts, we would like to describe some other extensions to the OCaml compilers that we are using on a daily basis. Our hope is that this can be useful to others, and maybe that we can discuss these ideas of extensions, how they can be improved, and so on.

The first extension I’m going to present makes it possible to use the same label name in several record types, or the same constructor name in several sum types. This was very much inspired by a proposal by Pierre Weis.

Let us consider the following fragment of code:

type s = {x: int; y: int; z: int}
type t = {x: int; y: int}

A well-known issue is that the second type declaration hides some labels that were already defined in the first one, which thus becomes unusable.

There are several typical work-arounds for this issue, but in my opinion, none is fully satisfactory. Let’s discuss them.

Word-around 1: don’t use the same names

Maybe the most popular work-around: an easy way to make names different is to introduce a common prefix to all labels in a given record type:

type s = {s_x: int; s_y: int; s_z: int}
type t = {t_x: int; t_y: int}

Frankly, this is not very nice. You need to repeat the prefix for each field when you write a record literal expression or pattern:

let r = {s_x = 1; s_y = 2; s_z = 3}

If the prefix is long enough (and it should be if you don’t want to be back intro trouble quickly), this becomes slightly annoying.

But more importantly, creating artificially different names makes the recent punning feature useless. For those who haven’t followed, punning allows you to write x instead x = x in a record expression or pattern. As soon as the names of the labels are more descriptive than x and y, punning is very cool. For instance, if labels remain unprefixed, one could write a nice function like:

let make_s ?(x = 0) ~y z = {x; y; z}

It would be sad to give up punning!

A final objection to this technique of prefixing labels to make them all different is that the label names are sometimes visible outside of your program: I’m thinking about LexiFi’s dynamic types, which I’ll cover in a later post, but this is also true e.g. if you generate automatically from type definitions XML, JSON or s-expression serialisers for your OCaml values.

Work-around 2: don’t use records

Instead of record types, one could define:

type s = < x : int; y : int; z : int >
type t = < x : int; y : int >

This actually has some advantages over record types as you can rely on row polymorphism and width subtyping. But there are a lot of reasons that makes this quite bad as a general work-around:

  • Syntax for object literal is quite verbose: compare {x = 1; y = 2} with object method x = 1 method y = 2 end.

  • No pattern matching! Runtime overhead.

  • You loose generic operations: comparison and hashing (objects are compared by identity), marshaling of data (between different processes).

  • There is no way to enforce that field accesses (which are now method calls) are really field accesses (that take no time and perform no side effect).

  • Error messages tend be harder to understand.

  • A somewhat minor point: structural types need to be regular, so forget about the type type ‘a t = [Leaf of 'a | Node of < x : (‘a * ‘a) t > ]! Too bad, given that OCaml now has good support for polymorphic recursion.

As for constructors and sum types, you could use polymorphic variants instead. The last two objections above also apply to them, but this is quite an acceptable solution (if you are not allergic to backquotes).

Work-around 3: hide yourself behind modules

If record types don’t play well together, put them in their own modules:

module S = struct
  type s = {x: int; y: int; z: int}
end
module T = struct
  type t = {x: int; y: int}
end

You can still use punning, as in:

let s_of_t {T.x; y} = {S.x; y; z = 0}

This relies on the fact that prefixing the first field with an explicit module path is enough to tell the compiler that all the labels need to be looked up in the that module.

This work-around is still quite heavy. If you use it systematically, it pollutes your interfaces and makes their ocamldoc-generated documentation less readable. Also, if you need to define recursive record types that share labels, you need to rely on recursive modules, and this gets even heavier:

module rec S : sig
  type t = {x: T.t option}
end = S
and T : sig
  type s = {x: S.s option; y: int}
end = T

Our solution: type names as qualifiers

In LexiFi’s custom version of OCaml, one can write:

type s = {x: int; y: int; z: int}
type t = {x: int; y: int}
let s_of_t {T.x; y} = {S.x; y; z = 0}

Our extension allows to refer explicitly to a given label or constructor by using the same dot notation as for module path qualifiers, except that the qualifier can now be also be a type name (itself, maybe, qualified with a module path). Since module names have to start with an uppercase and we did not want to introduce conflicts in the grammar, we simply require to use a capitalised version of the type name (first letter in uppercase).

We have been playing with this extension for a few months, and we like it. Being able to qualify only the first field of record literal expression or pattern really makes this syntactically lighter than explicitly prefixing every label name, and we don’t have any of the other disadvantages of work-around 1.

Our implementation of the solution is a little bit hackish because we need to preserve full compatibility with OCaml. So if a module named T exists in the current scope, a label T.x is looked in that module, not in a type t defined later on. So in some cases, it could be the case that even with our extension, there is no way to refer to a given label or constructor. This has not been an issue in practice. A variant of our solution (which was exactly what Pierre Weis originally proposed) would be to introduce a new syntax for type qualification to make it distinct from module qualification (Pierre proposed to use .^ instead of the dot). This is certainly cleaner, but the change is also more invasive; all the tools that depend on the concrete syntax of OCaml (Camlp4, maybe emacs mode) would need to be adapted, and the implementation is less trivial.

I can also mention another extension related to the same general issue, although it is technically orthogonal to the general solution I described above. This extra extension avoids the explicit qualification in many cases. The idea is that if you write {x = 1; y = 2; z = 3}, you clearly refer to the type s above, not t, because that type does not have a z field. More generally, instead of considering individually each label in a record literal expression or pattern, we treat all the labels at once and we look in the environment for the lastly defined type that contains all these labels. Record types that share some label names often differ in other label names, so as long as we use mostly the literal syntax to access fields (as opposed to the dot notation), we can disambiguate between record types this way. This trick does not work for sum types, though!

Instead of using type names as explicit qualifiers, one could try to use type information from the context in order to disambiguate between record types or sum types. One could give this idea a try, but it’s much more difficult that our syntactic solution, especially because we want to keep nice properties of the type system. Also, this solution might be turn out to be less predictable and less robust with respect to code refactoring (think about copy/pasting code from one function to another one; it might be the case that surrounding type annotations are enough in one context but not in the other one).