This post describes how unboxing for float values works in OCaml,
focusing on recent evolutions. Most of the content here applies
to boxed integer types as well (`int32`, `int64`
and `nativeint`).

Some background first. OCaml uses a uniform runtime representation of values, where each value can be stored in one machine word (32 or 64 bits). Some kind of values such a integers or Booleans are natively represented in this word. Other kinds are represented through a pointer: the word represents a pointer to the heap managed by the OCaml garbage collector (GC) and the actual data is found in the target block. The GC needs to distinguish between pointers and immediate values, and this is done by reserving one bit in each value words. This leaves only 31 or 63 bits for immediate values, and explain why OCaml integers are indeed 31 or 63 bit long, not 32 or 64. Since OCaml floats are 64-bit IEEE 754 numbers, they cannot be represented as an immediate value. The “regular” compilation strategy employed by OCaml is thus to wrap floats into allocated blocks. This is known as “boxing”. Because of it, a single float value occupies 3 words (on a 64-bit machine): the pointer, the block’s header word, and the data itself. This also mean that a block needs to be allocated everytime a new float value is created. Of course, this is quite bad for numerical intensive code.

Fortunately, OCaml uses a combination of techniques to alleviate
the overhead of boxed floats: specialised data structures (records
of floats and
(big)**arrays**)
that hold 64-bit floats in unboxed form, and local unboxed
representation of float variables (and intermediate results)
within a function body. (The compiler informs the runtime system
about which local variables should be tracked by the GC. Local
variables are thus not required to use the “uniform”
representation of values.)

There is also a proposal by Pierre Chambart to support passing
unboxed floats between functions (see
(**https://caml.inria.fr/mantis/view.php?id=5894**)), but this has
unfortunately not been merged yet. The impact on the ability to
write numerical code in a higher-level style without degrading
performances would be quite high.

The rest of this post will describe in more details the unboxing
technique within a function body. For the purpose of explaining
how it works, I will use a syntax close to “surface” OCaml even
though this optimisation happens at a rather late stage of the
compilation pipeline (namely, when mapping from the Clambda
intermediate representation to Cmm). I will also make the boxing
and unboxing operations explicit: `box e` means that the
expression `e` evaluates to a native (unboxed) float and
this operation allocates a block, stores the float in it, and
returns the pointer to the block; `unbox e` returns the
unboxed float found by dereferencing the pointer produced
by `e` (i.e. a memory load). Float literals will stand for
their native unboxed version, and float operators (such
as `+.`) are assumed to operate on unboxed floats. This
means that the regular translation of a source expression `e1
+. e2` results in `box (unboxed e1’ +. unboxed e2’)`,
i.e. we first retrieve the float from the block returned
by `e1’` and `e2’` (the translation of `e1`
and `e2`), compute the sum, and finally box the result.
Similarly, a float literal `42.` would be translated
into `box 42.`. Following these rules, a source
function `fun x -> (x *. x) +. 42.` would be translated to:

```
fun x -> box (unbox (box (unbox x *. unbox x)) +. unbox (box 42.))
```

The first optimisation is the immediate cancellation of `box`
and `unbox`: both `box (unbox e)` and `unbox (box
e)` can be simplified away. (This is justified because the
physical identity `==` of float blocks is not specified.) With this
rewriting, the code above becomes:

```
fun x -> box (unbox x *. unbox x +. 42.)
```

If you are interested to see this by yourself, you can
use `ocamlopt -dcmm` to dump the Cmm representation (where the
boxing is first materialised in the compilation pipeline):

```
(function camlFoo__fun_1322 (x/1320: val)
(alloc 2301 (+f (*f (load float64u x/1320) (load float64u x/1320)) 42.)))
```

We can clearly see the two memory reads (`unbox`) and the
allocation (`box`).

So, in a big arithmetic expression involving variables, float literals
and operators, there is typically one `unbox` for each variable
reference and one final `box` for the result.

Similarly for reading from an unboxed representation. Consider for
instance a record with only float fields; reading from a
field `r.l` would be translated to `box r.l` and if the
value is used directly as an argument of the float operator, the
boxing is avoided: `r.l +. 42.` is translated to `box (r.l +. 42.)`.
Writing to the field `r.l <- e` is translated to
`r.l <- unbox e’`; and again, the unboxing is eliminated if the
expression would return an explicitly boxed result. So `r.l <- r.l
+. 42.` does not involve any `box` or `unbox`
operation.

The `unbox` operator can actually be pushed inside some
expressions. For instance, `unbox (if … then e1 else e2)` is
rewritten to `if … then unbox e1 else unbox e2`. This was
introduced by a patch from Pascal Zimmer
(see **Mantis
4558** and the
corresponding **commit**.

Let’s now consider the more challenging question of unboxed local variables. With only the rules above, we would compile

```
fun x ->
let y = x +. 1. in
y *. 2.
```

into

```
fun x ->
let y = box (unbox x +. 1.) in
box (unbox y *. 2.)
```

The variable is used as a “channel” from the result of `+.` to
the first argument of `*.`. If we inlined the expression bound
to `y` on its use site, we would avoid this intermediate
boxing. Luckily, the OCaml compiler can decide to compile a
let-binding `let x = e1 in e2` into an unboxed form
`let x_unboxed = unbox e1’ in e2’` where references
to `x` in `e2` are translated to `box x_unboxed`.
This mean that instead of boxing on the binding site, we box on the
use site. Hopefully, the boxing site will immediately unbox the
variable (e.g. to use it as an argument of a float operator, or to
store it in a unboxed float data structure). If we apply this strategy,
the code above would be translated into:

```
fun x ->
let y = unbox x +. 1. in
box (y *. 2.)
```

Numerical code is often written in imperative style with references. Earlier in the compilation pipeline, such references can be turned into “mutable variables”. This happens when all accesses to the references are assignment or loads (i.e. the reference is not used as a “first-class” value, e.g. stored itself in a data structure or passed to another function), and all such accesses happen in the same function body (i.e. the reference is not captured by a local inner function). Let’s introduce some syntax to talk about such mutable local variables:

```
let mutable x = 10. in
for i = 1 to 1000 do
x <- x *. 2.
done;
x
```

which could be obtained from:

```
let x = ref 10. in
for i = 1 to 1000 do
x := !x *. 2.
done;
!x
```

(One could very well imagine that OCaml would be extended with such an explicit syntax at the source level. After all, there is already mutable identifiers, namely for-loop indices.)

The compiler can combine the two optimisations: “unboxing” of references (into local mutable variables) and unboxing of floats. So a “float ref” accumulator can be translated to an unboxed mutable float variable, and the code above becomes:

```
let mutable x = 10. in
for i = 1 to 1000 do
x <- x *. 2.
done;
box x
```

The only boxing is at the end. Without the unboxing strategy for compiling the let binding, one would get:

```
let mutable x = box 10. in
for i = 1 to 1000 do
x <- box (unbox x *. 2.)
done;
x
```

which would obviously be quite bad (1001 boxing instead of 1).

The question is now: when does the compiler decide to compile let bindings with this strategy? This strategy actually evolved quite a bit recently. But first, let’s think about which criteria would make sense:

Of course, we can only unbox if the bound expression returns a float. (Unfortunately, the unboxing decision is done on an intermediate code representation which does not hold explicit type information.)

If the bound expression naturally returns a boxed float (e.g. a value extracted from a boxed data structure), it does not ever seem to be a good idea to unbox it, at the risk of having to box it again later on each use site.

The argument above does not really hold for mutable variables: even if the initial value comes naturally in boxed form, subsequent assignments can be on unboxed values and having to box the new values to store them in the reference can be very costly.

One could also consider the number of use sites for the let-bound variable, how often they can be evaluated, and whether these use sites immediately unbox the variable (in which case no boxing is created on the use site): if there are multiple use sites and or a single use site under loop, delaying the boxing to the use site might increase the amount of boxing (except if it is in an “unboxing” context).

Now, let me describe the strategy actually implemented by the compiler and how it evolved in the previous months.

Before OCaml 4.03, the strategy was:

First, check whether the bound expression trivially yields a freshly boxed float (see below for details). If not, then no unboxing is performed.

Translate the let body, and look at all use sites. If one of them does not immediately unbox the variable, then:

If the variable is ever assigned, then no unboxing is performed for the let-bound variable.

Otherwise, we bind an unboxed variable to the bound expression and also compute a boxed version of it. Each use site picks the form it is interested in.

If all use site immediately unbox the variable, we bind only an unboxed variable and use it everywhere. This applies also to mutable variables.

The criterion to decide that the bound expression was a candidate for
unboxing was rather restrictive: only float operators and float
literals would be considered (inner sequences or let-binding in the
bound expression would be traversed as well). In particular, the
following cases would be excluded: simple identifiers, branching
constructs (e.g. an `if-then-else`), function calls.
Basically, the criterion was restricted to cases that would explicitly
box a float. But this is really overly restrictive for mutable variables,
where even if no boxing is avoided on the binding site, a lot of them
can be avoided later.

So the following cases would not allow unboxing the let-bound variable:

```
let x = ref (if cond then 1. else 0.) in
(* or: ref x0
or: ref (f 10.)
*)
for i = 1 to 1000 do
x := !x +. f i
done;
...
```

To force unboxing in these cases, the programmer could force the bound expression to look like a float operator:

```
let x = ref (0. +. ...) in
...
```

Moreover, with this strategy, a single use site was sufficient to break the unboxing. A particularly common case is with a mutable variable whose final value is returned by the function:

```
let f () =
let r = ref 0. in
for i = 1 to 1000 do
...
done;
!r
```

Since the result of functions are (for now) always in boxed form, the final use of the mutable variable would force the mutable float variable to be boxed. Again, the programmer could work around that by writing:

```
let f () =
let r = ref 0. in
for i = 1 to 1000 do
...
done;
!r +. 0.
```

It is not enough to track this specific case, but also all potential use sites that could break the unboxing. Example:

```
let f () =
let r = ref 0. in
for i = 1 to 1000 do
if i mod 100 = 0 then Printf.printf "%f\n" !r;
...
done;
!r +. 0.
```

Here the function takes a boxed float, and again one would need
to write `!r +. 0.` instead of `!r` to keep the nice
boxing. It should be clear that this strategy was rather fragile.

Vladimir Brankov proposed to change the strategy in
his **pull request
#109**, merged in
this **commit**.
The new strategy is to always unbox the variable when the bound
expression is a “valid candidate” (same definition as before), without
looking at the use sites. This addresses the problem with a single use
site breaking the unboxing.

Not having to turn `!r` into `!r +. 0.` any more was a
big relief!

The new strategy deliberately ignores the use sites. The unboxing
decision is now purely local, which makes it more robust to changes
in the `let` body. But yes, we forget about some potentially
useful information. Perhaps one should now think about adding a way
to *prevent* unboxing, although I’ve never felt a concrete need
for it.

With the new strategy, each use site can potentially box the unboxed
variable. One can easily build an example where a variable is unboxed
and then used within a long loop, forcing much more boxing than with
the previous strategy. See **below for a proposal to
address that risk**.

**Pull request
#107** (merged in
this **commit**),
again by Vladimir Brankov, allowed more cases of bound expressions to
be unboxed: namely, branching constructions where all
branches finish with an explicit boxing.

My **pull request #259** (see also
**Mantis ticket
#7022**) changes the traversal strategy, exploiting the fact that
the unboxing decision did no longer depend on the use sites, only on
the bound expression. Instead of translating the let body and then
only looking at use sites (to rewrite references to the now unboxed
variable), we now keep an environment of unboxed variables in the
recursive descent that implements the Clambda->Cmm mapping. This is
more efficient (compile-time wise), simpler (avoid some rewriting
pass), but, more importantly, this allows us to remove all boxing in a case such as:

```
let x = ref 1. in
for i = 1 to 100000 do
let y = if i mod 2 = 0 then !x else !x +. 1. in
x := !x +. y
done
```

Previously, the body of the `let x = … in …` binding was
compiled without knowing that `x` would be unboxed, and so
the `if-then-else` construction was not considered
to *always* return a freshly boxed float and thus was not
considered for unboxing.

So this was a first step towards relaxing the criterion to decide if a bound expression would be a candidate for unboxing. For the first time, one could unbox an expression even if it would not explicitly (always) create a freshly boxed value.

This is where OCaml 4.03 was branched.

The next natural step was to allow even more expressions to be unboxed. In particular, the following two bad situations could easily happen:

Mutable variable not being unboxed because the initial expression was not trivially returning a boxed value (e.g.

`let x = ref nan in…`). All stores in the references would then force some boxing if the new value comes from a float operation!Extra boxing in branching constructs. E.g. in

`let x = (if … then 1. +. x. else f ()) in …`, the result of the`+.`operator would always be boxed since the binding was not unboxed (because of the other branch).

My **pull
request #336** (which has been merged and will be part of OCaml 4.04)
addresses these two problems.

The general approach is the following:

For mutable variables known to hold floats, we always use the unboxed translation for the

`let`binding. Even if this does not avoid boxing for the initial value, we avoid the risk of adding boxing for each assignment in a loop.For immutable variables known to hold floats, we unbox if this

*might*avoid boxing in the bound expression. In particular, if the bound expression is an`if-then-else`, we only require one branch to box its result in order to trigger the unboxing (and thus avoid that boxing).If we are unsure about the type of the bound expression, we use the previous strategy: we require that the bound expression

*necessarily*generates a boxing to trigger the unboxing strategy.

Note that even if one branch returns a freshly boxed float result,
other branches of an `if-then-else` might return a value of a
different type. This is because of GADTs (which can specialise a type
variable in a branch), and there is currently no way to detect this
situation at the Clambda level. This is why we need to distinguish
the second and third cases above.

How do we determine if a let binding creates a float variable? The information is obviously available in the Typedtree representation, and the PR added some code to detect this situation at this level and pass this information down to Clambda. Similarly, we kept track of which variables correspond to eliminated references (i.e. mutable variables). Alternatively, one could have done an extra pass to detect variables which are being assigned, but this would be more costly.

Note that since unboxing happens after inlining, it does not currently
benefit from extra type information obtained after specialising a
polymorphic function at type `float`.

As explained above, the unboxing strategy since OCaml 4.03 does not consider use sites to decide if a let-bound variable should be kept in boxed or unboxed form. This can lead to multiple boxing of the same float value.

This section describes one possible approach to address this issue.
I had proposed it in **Mantis
ticket 5204**, but it is not currently integrated.

The idea is to “cache the boxing”. Whenever the body of
the `let` has some use site that boxes the
variable `x`, one would create a pair of variables
`x_boxed` and `x_unboxed`. The unboxed form
stores the “official” value for the variable, and the boxed
form is used as a cache for it.

Assignment is done directly to `x_unboxed` without
touching `x_boxed`. Loading from the variable in a context
that requires the unboxed form is similarly done by loading directly
from `x_unboxed`. This means that as long as the variable is
used in unboxing contexts, everything works as with the current
strategy.

The interesting part is loading from the variable in a context that
requires the boxed form. This is done by checking if the value
pointed by `x_boxed` is equal to `x_unboxed` and if so
returning `x_boxed` directly; otherwise refreshing
first `x_boxed` by boxing the current value
of `t_unboxed`. So the pseudo code for getting
the value in boxed form is:

```
if unboxed x_boxed != x_boxed then x_boxed <- box x_unboxed;
x_boxed
```

With this approach, boxing happens at most once after each assignment (that actually changes the value).

The way the initial values for `x_boxed`
and `x_unboxed` are computed would depend on the bound
expression. If this expression necessarily forces a boxing, one can
remove this boxing and compute `x_unboxed` directly, storing
a dummy value in `x_boxed` (e.g. a global boxed `0.`
number). We thus avoid one boxing which might be useless (if the
variable is never read in a boxing context before being assigned).
Otherwise, for instance if the expression is a function call, we
bind `x_boxed` to the result of the expression and we
load `x_unboxed` from it. This allows us to avoid one extra
boxing for the first use in a boxed context.

This scheme would ensure that no more boxing happen than either the current strategy or the previous one. But it can generates bigger code with some conditional branching and memory loads, so the actual impact on performances would need to be investigated. Static heuristics could be used to remove some of the dynamic tests (e.g. to remove the useless check in case of successive loads in boxing context with no possibility of an assignment between them). Anyway, more work is needed before this can be integrated.

Another aspect related to unboxing where OCaml was recently improved
is the topic of external primitives (those declared with `external
foo: …`), typically C or Fortran functions to be called from
OCaml. Before OCaml 4.03, one could annotate
these `external` declarations to inform the compiler
that *all* their argument and also their return value are
native unboxed floats. This allowed the compiler to avoid boxing
the arguments, and the external code to avoid boxing the result
(boxing of the result is on the OCaml side if required).

Starting with version 4.03, OCaml understands special attributes
on `external` declarations to specify which arguments and/or
the result are unboxed floats: this
was **pull request
#167** by Jérémie Dimino. This is very good, since one can now
avoid boxing for much more primitives (some
examples: `Pervasives.classify_float`, `Sys.time`, the
specialised version of the generic comparison).