Alain Frisch 2015-01-18

This post is about the special representation that OCaml uses for arrays of floating point values. My claim is that this special representation is useful, but also harmful in some contexts. There are better alternatives that achieve the same benefits without the drawbacks.

If you are familiar about the OCaml representation of value and the special representation for arrays of floats, you can skip the first two sections.

## A quick reminder about OCaml uniform value representation

OCaml follows a uniform runtime representation: all values can be kept in a single machine word (32- or 64-bit); a value is either an unboxed integer or a pointer, and the two cases are distinguished by the lower bit: 0 for pointers – which are thus assumed to be 2-aligned; and 1 for integers. Unboxed integers are used to represent OCaml values of type int (encoded as 2*X+1, which restricts them to 3163 bits), but also values of type bool, char and constant constructors (in sum types and polymorphic variants). Pointers usually refer to blocks in the OCaml heap, which have one header word followed by one or several data words. The header encodes the block’s size, 2 bits reserved for the Garbage Collector, and a 8-bit tag, which categorizes the nature of the block (e.g. 0 for tuples, records, arrays; 247 for function closures; 252 for strings) or identifies non-constant constructors.

This representation has many advantages. In particular, it allows simple support for data abstraction, separate compilation, polymorphic code and generic operations (such as the Garbage Collector, generic equality/hash/marshaling functions).

## Floats and float arrays

OCaml has a single type for floating point values (called float) that are double-precision (64-bit) values. Since 64-bit cannot be represented as an unboxed integer even on 64-bit machines (because unbox integers are only 63-bit), a float value is represented by a pointer to a block (tag 253) holding one data word (64-bit) or two (32-bit). Floats are thus boxed, and a single float value requires three or four words (one for representing the pointer, one for the block header, and one or two for the float bits). Note that this representation is only required when floats are stored in data structures, or passed from one function to another. Within a single function’s body, local float values can be represented unboxed in memory or in registers, and the OCaml compiler apply such optimizations. There are also plans to allow unboxed calling conventions between functions.

OCaml arrays are normally represented by a pointer to a block with tag 0 holding one data word per array element. The length of the array is thus obtained directly from the size field of the block header, and accessing the nth-element of the array is very simple.

With this representation, an array of N floats requires around 3*N words on the heap (on 64-bit machines, assuming no sharing between boxed floats in the array), which is quite bad. Also, a simple operation such as adding 1 to an element of the array requires one extra memory indirection to read the float value, one memory allocation to hold the result of the arithmetic operation, and several memory writes. Numerical code would suffer quite a bit from such a representation.

To avoid such memory and runtime overhead, OCaml uses a special representation for arrays of floats. They are represented by blocks with a dedicated tag (254) that hold the unboxed floats. So an array of N floats requires around N words (on 64-bit machines), and a simple operation such as adding 1 to an element does not require any allocation.

While introducing the special unboxed representation for arrays of floats might seem a clever and useful hack, it has some negative consequences.

When a function accesses an array (to read or write an element), it typically needs to account for the two possible representations of the array. For instance, the OCaml functions

let f arr = arr.(0)
let g arr = arr.(0) <- arr.(1)


are compiled into this cmm code (I’m using ‘-unsafe’ mode to avoid cluttering the code with bound checks):

(function camlFoo__f_1215 (arr/1216: addr)

(let
newval/1224
(if (!= (load unsigned int8 (+a arr/1218 -8)) 254)
(if (!= (load unsigned int8 (+a arr/1218 -8)) 254)
(extcall "caml_modify" arr/1218 newval/1224 unit)
(store float64u arr/1218 (load float64u newval/1224))))
1a)


You can see the explicit checks on the array header. They have a runtime performance hit, and the generated code size greatly increases.

The checks are removed if the compiler knows statically whether the elements in the array are floats or not. Unfortunately, it is quite hard to know statically that something is not a float, since the type system cannot express the fact that an abstract type is not an alias for float or that a type variable cannot be instantiated to float. So in some cases, even adding type annotations is not enough to get rid of the dynamic checks, even if you know that the array doesn’t contain floats.

If the compiler knows that the array contains floats, it shortcuts the dynamic check. This means that an array of floats is not allowed to use the standard representation: it has to use the unboxed form. The consequence is that when a new array is created, one must decide it is a float array or not. Again, type abstract makes it generally impossible to have this information statically, so one needs to rely on a runtime check. Concretely, Array.make observes the initial value and bases its decision on whether this value is a float or not (looking at the header tag).

The consequence of the paragraph above is a very strong constraint on the runtime system: a value which is represented at runtime by a float (i.e. a block with tag 253) must be of static type float (or an alias of it). This interacts badly with other (potential) optimizations. The existing optimized representation of lazy values is an example. When a lazy value value is forced, the thunk is replaced to a forward pointer (tag 250) to the result and when the GC runs, it short-circuits such forward pointers (i.e. the forced lazy value becomes physically equal to the result). This adds a tiny bit of extra work when forcing a value, because several cases need to be accounted for (an unforced thunk, with tag 256; a forward pointer, with tag 250; a block with any other tag, or an immediate integer: the result after short-circuiting); but this avoids one memory read (when the value is forced and after short-circuiting – checking the tag has to be done anyway) and reduces the number of blocks in the heap (and thus time spent in the GC). Anyway, this blog post is not about this optimization on lazy values. The point is that this optimization doesn’t work well when the lazy result is a float: short-circuiting the forward pointer would mean that the value of type float Lazy.t would be represented as a float, and this would break the invariant described above. Not a big deal, as it is easy enough to take this into account and disable short-circuiting when the lazy result is a float. But we don’t benefit from the nice optimization on lazy floats. And more importantly, this shows that we need to be careful when introducing clever representation for seemingly unrelated types (here, Lazy.t).

For instance, one might want, some day, to let developers specify inline attributes on sum type constructors, as in:

type value =
| Str [@inline] of string
| Int [@inline] of int

type tree =
| Leaf [@inline] of int
| Node of int * tree * tree


The meaning of this attribute is that the constructor behaves as the identity on runtime values: no allocation is required to wrap the argument inside the constructor.

The compiler would accept such declarations if enough information remains at runtime to distinguish between constructors (e.g. defining two different inline constructors with a string argument in a single sum type would be rejected). Perhaps not the kind of things you want to expose to beginners, but certainly something that could bring interesting performance gains in tight data structures and algorithms. Unfortunately, even if floats have their own runtime tag, this nice optimization wouldn’t be available for them, only because of unboxed float arrays. Too bad if you write an interpreter for a language whose value type has a “Float of float” case, or if you manipulate binary trees with float leaves.

Another related interesting optimization would be a special representation for option types that would avoid the wrapping/allocation for the Some constructor (with some special representation for None, for Some None, etc). This could give huge gains in some contexts, and many people would love to see this implemented. But again, the story becomes more complex because of floats. One would have to plan for extra runtime checks and extra code paths to disable the optimization on floats.

The fact that all kinds of potential nice optimizations are made more complex and less effective because of the automatic special representation for float arrays is my main criticism against that special representation (even more than the existing overhead on array accesses).

## Why only floats?

People doing numerical code certainly appreciate that their float arrays are unboxed. But other element types would also benefit from a custom array representation:

• Arrays of ints: the potential gain here is not about avoiding boxing (because ints are not boxed), but about avoiding useless GC scans over large int arrays. Concretely, one could use the same header tag as for strings/bytes.

• Arrays of bools: since a bool requires only one bit, using a full word for each bool in an array is a clear waste of space; and one could also let the GC know that it doesn’t need to scan the array (as for ints).

• Arrays of chars: using the normal array representation means that storing N bytes takes around N machine words (OCaml chars are really bytes).

Of course, nobody uses char array in OCaml, when there is a much better alternative: bytes (previously, string). And does anyone complain that char array is not optimized automatically? I don’t think so!

## Proposal

Considering the bad consequences of the automatic custom representation for float arrays, the long-term goals of having more clever representations for e.g. option types and other inlined constructors, and the usefulness of custom representation for other array types (ints, bools, chars), here is my proposal:

• Define a module type:
module type ARRAY = sig
type elt
type t

val make: int -> elt -> t
val length: t -> int

val get: t -> int -> elt
val (.()): t -> int -> elt

val set: t -> int -> elt -> unit
val (.()<-): t -> int -> elt -> unit

val sub: t -> int -> int -> t
val max_length: int
(* ... *)
end

• Ensure that Bytes satisfies the ARRAY with type elt = char interface.

• Define a module FloatArray : ARRAY with type elt = float (unboxed representation), and similarly for int and bool (bit field), ensuring that int and bool arrays are not scanned by the GC.

• Similarly, one could expose bigarrays as another available implementation for these modules.

• Expose a functor Array.MakeImpl: functor (X : sig type t end) -> ARRAY with type elt = X.t which generates array implementations using the standard representation.

• In order to support polymorphic algorithms that require arrays, and without turning them into functors, define:

type ('a, 'b) array_impl =
(module ARRAY with type elt = 'a and type t = 'b)


so that functions can be given types such as:

val sort: ('a, 'b) array_impl -> ('a -> 'a -> int) -> 'b -> 'b
val array_map: ('a, 'b) array_impl -> ('c, 'd) array_impl -> ('a -> 'c) -> 'b -> 'd


It could also be useful to have another abbreviation:

type 'a array_impl' = (module ARRAY with type elt = 'a)


This would be useful for algorithms that use arrays only internally, and also for:

val mk_impl: unit -> 'a array_impl'  (* same as Array.MakeImpl *)
`
• Other nice functors/functions can be implemented e.g. for arrays on t1 * t2 from implementations of arrays on type t1 and on type t2 (i.e. keeping a pair of arrays).

• Of course, get rid of the current automatic representation of float arrays,and simplify the code generator and runtime system accordingly.

Thanks to the new array access operators, one can simply write FloatArray.(a.(i) <- b.(i)), or more realistically let open FloatAray in … in functions which are mostly dealing with float arrays.

The main consequence of such a move is that numerical code that manipulates float arrays and need to remain efficient would need to switch from float array to FloatArray.t. I don’t think this would be too bad in practice. It will also avoid that a forgotten type annotation (required to turn a polymorphic function into a function on float array) results in a huge performance loss.

The parametric ‘a array type constructors remains, and it’s perfectly fine to use it directly in any code where you don’t need special representations. As a matter of fact, I don’t think there will be a lot of uses for the Array.MakeImpl functor or for the array_impl type. In my experience, arrays are most often used in monomorphic contexts, and when they don’t, performance might not matter too much (so we don’t need the custom representations).

## Migration path

If we skip the last bullet point of the proposal above (dropping the automatic representation of float arrays), all of it can be implemented now, and quite easily.

Then, the legacy automatic representation of float arrays could be controlled through a configure-time switch, enabled by default for some versions of OCaml (to give time to people to go through their code base and use the new style), then disabled by default, and finally completely removed.

To help people find occurrence of generic float array in their code base, one could combine static and dynamic approaches:

• Parsing .cmt files, it is easy enough to spot occurrences of float array.

• If needed, a configure-time switch could trigger a warning on stderr e.g. when Array.make is called on float.

I’d like to hear the opinion of the community of OCaml developers about that proposal!

LexiFi • 892 rue Yves Kermen • F-92100 Boulogne-Billancourt • France