This post describes how unboxing for float values works in OCaml, focusing on recent evolutions. Most of the content here applies to boxed integer types as well (int32, int64 and nativeint).
Some background first. OCaml uses a uniform runtime representation of values, where each value can be stored in one machine word (32 or 64 bits). Some kind of values such a integers or Booleans are natively represented in this word. Other kinds are represented through a pointer: the word represents a pointer to the heap managed by the OCaml garbage collector (GC) and the actual data is found in the target block. The GC needs to distinguish between pointers and immediate values, and this is done by reserving one bit in each value words. This leaves only 31 or 63 bits for immediate values, and explain why OCaml integers are indeed 31 or 63 bit long, not 32 or 64. Since OCaml floats are 64-bit IEEE 754 numbers, they cannot be represented as an immediate value. The “regular” compilation strategy employed by OCaml is thus to wrap floats into allocated blocks. This is known as “boxing”. Because of it, a single float value occupies 3 words (on a 64-bit machine): the pointer, the block’s header word, and the data itself. This also mean that a block needs to be allocated everytime a new float value is created. Of course, this is quite bad for numerical intensive code.
Fortunately, OCaml uses a combination of techniques to alleviate the overhead of boxed floats: specialised data structures (records of floats and (big)arrays) that hold 64-bit floats in unboxed form, and local unboxed representation of float variables (and intermediate results) within a function body. (The compiler informs the runtime system about which local variables should be tracked by the GC. Local variables are thus not required to use the “uniform” representation of values.)
There is also a proposal by Pierre Chambart to support passing unboxed floats between functions (see (https://caml.inria.fr/mantis/view.php?id=5894)), but this has unfortunately not been merged yet. The impact on the ability to write numerical code in a higher-level style without degrading performances would be quite high.
The rest of this post will describe in more details the unboxing technique within a function body. For the purpose of explaining how it works, I will use a syntax close to “surface” OCaml even though this optimisation happens at a rather late stage of the compilation pipeline (namely, when mapping from the Clambda intermediate representation to Cmm). I will also make the boxing and unboxing operations explicit: box e means that the expression e evaluates to a native (unboxed) float and this operation allocates a block, stores the float in it, and returns the pointer to the block; unbox e returns the unboxed float found by dereferencing the pointer produced by e (i.e. a memory load). Float literals will stand for their native unboxed version, and float operators (such as +.) are assumed to operate on unboxed floats. This means that the regular translation of a source expression e1 +. e2 results in box (unboxed e1’ +. unboxed e2’), i.e. we first retrieve the float from the block returned by e1’ and e2’ (the translation of e1 and e2), compute the sum, and finally box the result. Similarly, a float literal 42. would be translated into box 42.. Following these rules, a source function fun x -> (x *. x) +. 42. would be translated to:
fun x -> box (unbox (box (unbox x *. unbox x)) +. unbox (box 42.))
The first optimisation is the immediate cancellation of box and unbox: both box (unbox e) and unbox (box e) can be simplified away. (This is justified because the physical identity == of float blocks is not specified.) With this rewriting, the code above becomes:
fun x -> box (unbox x *. unbox x +. 42.)
If you are interested to see this by yourself, you can use ocamlopt -dcmm to dump the Cmm representation (where the boxing is first materialised in the compilation pipeline):
(function camlFoo__fun_1322 (x/1320: val) (alloc 2301 (+f (*f (load float64u x/1320) (load float64u x/1320)) 42.)))
We can clearly see the two memory reads (unbox) and the allocation (box).
So, in a big arithmetic expression involving variables, float literals and operators, there is typically one unbox for each variable reference and one final box for the result.
Similarly for reading from an unboxed representation. Consider for instance a record with only float fields; reading from a field r.l would be translated to box r.l and if the value is used directly as an argument of the float operator, the boxing is avoided: r.l +. 42. is translated to box (r.l +. 42.). Writing to the field r.l <- e is translated to r.l <- unbox e’; and again, the unboxing is eliminated if the expression would return an explicitly boxed result. So r.l <- r.l +. 42. does not involve any box or unbox operation.
The unbox operator can actually be pushed inside some expressions. For instance, unbox (if … then e1 else e2) is rewritten to if … then unbox e1 else unbox e2. This was introduced by a patch from Pascal Zimmer (see Mantis 4558 and the corresponding commit.
Let’s now consider the more challenging question of unboxed local variables. With only the rules above, we would compile
fun x -> let y = x +. 1. in y *. 2.
fun x -> let y = box (unbox x +. 1.) in box (unbox y *. 2.)
The variable is used as a “channel” from the result of +. to the first argument of *.. If we inlined the expression bound to y on its use site, we would avoid this intermediate boxing. Luckily, the OCaml compiler can decide to compile a let-binding let x = e1 in e2 into an unboxed form let x_unboxed = unbox e1’ in e2’ where references to x in e2 are translated to box x_unboxed. This mean that instead of boxing on the binding site, we box on the use site. Hopefully, the boxing site will immediately unbox the variable (e.g. to use it as an argument of a float operator, or to store it in a unboxed float data structure). If we apply this strategy, the code above would be translated into:
fun x -> let y = unbox x +. 1. in box (y *. 2.)
Numerical code is often written in imperative style with references. Earlier in the compilation pipeline, such references can be turned into “mutable variables”. This happens when all accesses to the references are assignment or loads (i.e. the reference is not used as a “first-class” value, e.g. stored itself in a data structure or passed to another function), and all such accesses happen in the same function body (i.e. the reference is not captured by a local inner function). Let’s introduce some syntax to talk about such mutable local variables:
let mutable x = 10. in for i = 1 to 1000 do x <- x *. 2. done; x
which could be obtained from:
let x = ref 10. in for i = 1 to 1000 do x := !x *. 2. done; !x
(One could very well imagine that OCaml would be extended with such an explicit syntax at the source level. After all, there is already mutable identifiers, namely for-loop indices.)
The compiler can combine the two optimisations: “unboxing” of references (into local mutable variables) and unboxing of floats. So a “float ref” accumulator can be translated to an unboxed mutable float variable, and the code above becomes:
let mutable x = 10. in for i = 1 to 1000 do x <- x *. 2. done; box x
The only boxing is at the end. Without the unboxing strategy for compiling the let binding, one would get:
let mutable x = box 10. in for i = 1 to 1000 do x <- box (unbox x *. 2.) done; x
which would obviously be quite bad (1001 boxing instead of 1).
The question is now: when does the compiler decide to compile let bindings with this strategy? This strategy actually evolved quite a bit recently. But first, let’s think about which criteria would make sense:
Of course, we can only unbox if the bound expression returns a float. (Unfortunately, the unboxing decision is done on an intermediate code representation which does not hold explicit type information.)
If the bound expression naturally returns a boxed float (e.g. a value extracted from a boxed data structure), it does not ever seem to be a good idea to unbox it, at the risk of having to box it again later on each use site.
The argument above does not really hold for mutable variables: even if the initial value comes naturally in boxed form, subsequent assignments can be on unboxed values and having to box the new values to store them in the reference can be very costly.
One could also consider the number of use sites for the let-bound variable, how often they can be evaluated, and whether these use sites immediately unbox the variable (in which case no boxing is created on the use site): if there are multiple use sites and or a single use site under loop, delaying the boxing to the use site might increase the amount of boxing (except if it is in an “unboxing” context).
Now, let me describe the strategy actually implemented by the compiler and how it evolved in the previous months.
Before OCaml 4.03, the strategy was:
First, check whether the bound expression trivially yields a freshly boxed float (see below for details). If not, then no unboxing is performed.
Translate the let body, and look at all use sites. If one of them does not immediately unbox the variable, then:
If the variable is ever assigned, then no unboxing is performed for the let-bound variable.
Otherwise, we bind an unboxed variable to the bound expression and also compute a boxed version of it. Each use site picks the form it is interested in.
If all use site immediately unbox the variable, we bind only an unboxed variable and use it everywhere. This applies also to mutable variables.
The criterion to decide that the bound expression was a candidate for unboxing was rather restrictive: only float operators and float literals would be considered (inner sequences or let-binding in the bound expression would be traversed as well). In particular, the following cases would be excluded: simple identifiers, branching constructs (e.g. an if-then-else), function calls. Basically, the criterion was restricted to cases that would explicitly box a float. But this is really overly restrictive for mutable variables, where even if no boxing is avoided on the binding site, a lot of them can be avoided later.
So the following cases would not allow unboxing the let-bound variable:
let x = ref (if cond then 1. else 0.) in (* or: ref x0 or: ref (f 10.) *) for i = 1 to 1000 do x := !x +. f i done; ...
To force unboxing in these cases, the programmer could force the bound expression to look like a float operator:
let x = ref (0. +. ...) in ...
Moreover, with this strategy, a single use site was sufficient to break the unboxing. A particularly common case is with a mutable variable whose final value is returned by the function:
let f () = let r = ref 0. in for i = 1 to 1000 do ... done; !r
Since the result of functions are (for now) always in boxed form, the final use of the mutable variable would force the mutable float variable to be boxed. Again, the programmer could work around that by writing:
let f () = let r = ref 0. in for i = 1 to 1000 do ... done; !r +. 0.
It is not enough to track this specific case, but also all potential use sites that could break the unboxing. Example:
let f () = let r = ref 0. in for i = 1 to 1000 do if i mod 100 = 0 then Printf.printf "%f\n" !r; ... done; !r +. 0.
Here the function takes a boxed float, and again one would need to write !r +. 0. instead of !r to keep the nice boxing. It should be clear that this strategy was rather fragile.
Vladimir Brankov proposed to change the strategy in his pull request #109, merged in this commit. The new strategy is to always unbox the variable when the bound expression is a “valid candidate” (same definition as before), without looking at the use sites. This addresses the problem with a single use site breaking the unboxing.
Not having to turn !r into !r +. 0. any more was a big relief!
The new strategy deliberately ignores the use sites. The unboxing decision is now purely local, which makes it more robust to changes in the let body. But yes, we forget about some potentially useful information. Perhaps one should now think about adding a way to prevent unboxing, although I’ve never felt a concrete need for it.
With the new strategy, each use site can potentially box the unboxed variable. One can easily build an example where a variable is unboxed and then used within a long loop, forcing much more boxing than with the previous strategy. See below for a proposal to address that risk.
Pull request #107 (merged in this commit), again by Vladimir Brankov, allowed more cases of bound expressions to be unboxed: namely, branching constructions where all branches finish with an explicit boxing.
My pull request #259 (see also Mantis ticket #7022) changes the traversal strategy, exploiting the fact that the unboxing decision did no longer depend on the use sites, only on the bound expression. Instead of translating the let body and then only looking at use sites (to rewrite references to the now unboxed variable), we now keep an environment of unboxed variables in the recursive descent that implements the Clambda->Cmm mapping. This is more efficient (compile-time wise), simpler (avoid some rewriting pass), but, more importantly, this allows us to remove all boxing in a case such as:
let x = ref 1. in for i = 1 to 100000 do let y = if i mod 2 = 0 then !x else !x +. 1. in x := !x +. y done
Previously, the body of the let x = … in … binding was compiled without knowing that x would be unboxed, and so the if-then-else construction was not considered to always return a freshly boxed float and thus was not considered for unboxing.
So this was a first step towards relaxing the criterion to decide if a bound expression would be a candidate for unboxing. For the first time, one could unbox an expression even if it would not explicitly (always) create a freshly boxed value.
This is where OCaml 4.03 was branched.
The next natural step was to allow even more expressions to be unboxed. In particular, the following two bad situations could easily happen:
Mutable variable not being unboxed because the initial expression was not trivially returning a boxed value (e.g. let x = ref nan in…). All stores in the references would then force some boxing if the new value comes from a float operation!
Extra boxing in branching constructs. E.g. in let x = (if … then 1. +. x. else f ()) in …, the result of the +. operator would always be boxed since the binding was not unboxed (because of the other branch).
My pull request #336 (which has been merged and will be part of OCaml 4.04) addresses these two problems.
The general approach is the following:
For mutable variables known to hold floats, we always use the unboxed translation for the let binding. Even if this does not avoid boxing for the initial value, we avoid the risk of adding boxing for each assignment in a loop.
For immutable variables known to hold floats, we unbox if this might avoid boxing in the bound expression. In particular, if the bound expression is an if-then-else, we only require one branch to box its result in order to trigger the unboxing (and thus avoid that boxing).
If we are unsure about the type of the bound expression, we use the previous strategy: we require that the bound expression necessarily generates a boxing to trigger the unboxing strategy.
Note that even if one branch returns a freshly boxed float result, other branches of an if-then-else might return a value of a different type. This is because of GADTs (which can specialise a type variable in a branch), and there is currently no way to detect this situation at the Clambda level. This is why we need to distinguish the second and third cases above.
How do we determine if a let binding creates a float variable? The information is obviously available in the Typedtree representation, and the PR added some code to detect this situation at this level and pass this information down to Clambda. Similarly, we kept track of which variables correspond to eliminated references (i.e. mutable variables). Alternatively, one could have done an extra pass to detect variables which are being assigned, but this would be more costly.
Note that since unboxing happens after inlining, it does not currently benefit from extra type information obtained after specialising a polymorphic function at type float.
As explained above, the unboxing strategy since OCaml 4.03 does not consider use sites to decide if a let-bound variable should be kept in boxed or unboxed form. This can lead to multiple boxing of the same float value.
This section describes one possible approach to address this issue. I had proposed it in Mantis ticket 5204, but it is not currently integrated.
The idea is to “cache the boxing”. Whenever the body of the let has some use site that boxes the variable x, one would create a pair of variables x_boxed and x_unboxed. The unboxed form stores the “official” value for the variable, and the boxed form is used as a cache for it.
Assignment is done directly to x_unboxed without touching x_boxed. Loading from the variable in a context that requires the unboxed form is similarly done by loading directly from x_unboxed. This means that as long as the variable is used in unboxing contexts, everything works as with the current strategy.
The interesting part is loading from the variable in a context that requires the boxed form. This is done by checking if the value pointed by x_boxed is equal to x_unboxed and if so returning x_boxed directly; otherwise refreshing first x_boxed by boxing the current value of t_unboxed. So the pseudo code for getting the value in boxed form is:
if unboxed x_boxed != x_boxed then x_boxed <- box x_unboxed; x_boxed
With this approach, boxing happens at most once after each assignment (that actually changes the value).
The way the initial values for x_boxed and x_unboxed are computed would depend on the bound expression. If this expression necessarily forces a boxing, one can remove this boxing and compute x_unboxed directly, storing a dummy value in x_boxed (e.g. a global boxed 0. number). We thus avoid one boxing which might be useless (if the variable is never read in a boxing context before being assigned). Otherwise, for instance if the expression is a function call, we bind x_boxed to the result of the expression and we load x_unboxed from it. This allows us to avoid one extra boxing for the first use in a boxed context.
This scheme would ensure that no more boxing happen than either the current strategy or the previous one. But it can generates bigger code with some conditional branching and memory loads, so the actual impact on performances would need to be investigated. Static heuristics could be used to remove some of the dynamic tests (e.g. to remove the useless check in case of successive loads in boxing context with no possibility of an assignment between them). Anyway, more work is needed before this can be integrated.
Another aspect related to unboxing where OCaml was recently improved is the topic of external primitives (those declared with external foo: …), typically C or Fortran functions to be called from OCaml. Before OCaml 4.03, one could annotate these external declarations to inform the compiler that all their argument and also their return value are native unboxed floats. This allowed the compiler to avoid boxing the arguments, and the external code to avoid boxing the result (boxing of the result is on the OCaml side if required).
Starting with version 4.03, OCaml understands special attributes on external declarations to specify which arguments and/or the result are unboxed floats: this was pull request #167 by Jérémie Dimino. This is very good, since one can now avoid boxing for much more primitives (some examples: Pervasives.classify_float, Sys.time, the specialised version of the generic comparison).