Sliding-window max (and other associative operators)

Consider a sequence of $n$ numbers and imagine you need to compute the maximum of all adjacent sub-sequences of length $k$. How would you do that efficiently?

Let’s write $x_0, \ldots, x\_{n-1}$ for the original sequence. We want to compute $y\_0, \ldots, y\_{n-k}$ with $y_i = \max(x\_i,\ldots,x\_{i+k-1})$. Such k-ary $\max$ needs to be computed in terms its binary version $\max(x\_1,x\_2)$. To simplify notations, we will write it $[x_1 x_2]$.

The naive approach

A naive way to compute $y_i$ is by “folding” the binary operator. For $k = 5$, we could write:

$$y\_0 = [ [ [ [ x\_0 x\_1 ] x\_2 ] x\_3 ] x\_4 ]$$ $$y\_1 = [ [ [ [ x\_1 x\_2 ] x\_3 ] x\_4 ] x\_5 ]$$ $$y\_2 = [ [ [ [ x\_2 x\_3 ] x\_4 ] x\_5 ] x\_6] $$ $$\ldots$$

Each $y_i$ requires computing $k-1$ binary max, and there is no sharing between sub-expressions between two $y_i$ and $y_j$. So, to compute all $y_i$, we need $(n-k+1)(k-1)$ binary operations, i.e. roughly $nk$ for large $n$ and $k$. We could “fold in the other direction”:

$$y\_0 = [ x\_0 [ x\_1 [ x\_2 [ x\_3 x\_4 ] ] ] ]$$ $$y\_1 = [ x\_1 [ x\_2 [ x\_3 [ x\_4 x\_5 ] ] ] ]$$ $$y\_2 = [ x\_2 [ x\_3 [ x\_4 [ x\_5 x\_6 ] ] ] ]$$ $$\ldots$$

Of course, we get the same number of operations. Can we do better?

The problem with the naive approach above is that we cannot reuse any intermediate computation. Even though $[x\_1 x\_2]$ is “part” of both $y_0$ and $y_1$, the way the computations are organised does not allow us to benefit from such sharing.

A better algorithm

Let’s move brackets around, relying on the associativity of the $\max$ operator. We can rewrite:

$$y\_0 = [ x\_0 [ x\_1 [ x\_2 [ x\_3 x\_4 ] ] ] ]$$ $$y\_1 = [ [ x\_1 [ x\_2 [ x\_3 x\_4 ] ] ] x\_5 ]$$ $$y\_2 = [ [ x\_2 [ x\_3 x\_4 ] ] [ x\_5 x\_6 ] ]$$ $$y\_3 = [ [ x\_3 x\_4 ] [ [ x\_5 x\_6 ] x\_7 ] ]$$ $$y\_4 = [ x\_4 [ [ [ x\_5 x\_6 ] x\_7 ] x\_8 ] ]$$ $$y\_5 = [ [ [ [ x\_5 x\_6 ] x\_7 ] x\_8 ] x\_9 ]$$

When computing $y\_1$, one can reuse the internal bracket $[ x\_1 [ x\_2 [ x\_3 x\_4 ] ] ]$ we had already computed as part of $y\_0$. So computing $y\_1$ involves just one more binary $\max$. If we keep track of such sharing, we can see that computing $y\_0, \ldots, y\_5$ involves only 12 operations, which is twice better than the 24 for the native algorithm (with $N=10, k=5$). If we want to go further in the sequence above, one can just “shift” the indices:

$$y\_6 = [ x\_6 [ x\_7 [ x\_8 [ x\_9 x\_{10} ] ] ] ]$$ $$y\_7 = [ [ x\_7 [ x\_8 [ x\_9 x\_{10} ] ] ] x\_{11} ]$$ $$y\_8 = [ [ x\_8 [ x\_9 x\_{10} ] ] [ x\_{11} x\_{12} ] ]$$ $$y\_9 = [ [ x\_9 x\_{10} ] [ [ x\_{11} x\_{12} ] x\_{13} ] ]$$ $$y\_{10} = [ x\_{10} [ [ [ x\_{11} x\_{12} ] x\_{13} ] x\_{14} ] ]$$ $$y\_{11} = [ [ [ [ x\_{11} x\_{12} ] x\_{13} ] x\_{14} ] x\_{15} ]$$

Note that we do not reuse any previously computed bracket for that new block $y_6, \ldots, y\_{11}$. Let’s count the number of operations in general. For the first block $y_0, \ldots, y_k$, we need:

The left and right folding $[x\_1 [ x\_2 [ x\_3 \ldots [ x\_{k-2} x\_{k-1} ] \ldots ] ] ]$ and $[ \ldots [ [x\_k x\_{k+1}] x\_{k+2} ] \ldots x_{2k-2} ]$, remembering their internal brackets. This requires $2(k-2)$ binary operations.
Then, each $y_i$ can be computed with one more binary operation (combining two of the results obtained above, or of of them and some $x_i$), hence $k + 1$ more operations.

To compute the first $k+1$ $y_i$, we thus need $3(k-1)$ operations. So, when the total number of $y_i$ to compute (i.e. $n-k+1$) is a multiple of $k+1$, we need $3(n-k+1)(1-\frac{2}{k+1})$ operations, i.e. roughly $3n$ for large $n$ and $k$, which is of course much better than $nk$ for the naive algorithm. The algorithm has some nice properties:

We can produce the $y_i$ series sequentially, remembering only at most $k$ values at each step.
We can also parallelize the calculation across independent blocks (of length $k+1$), since no intermediate result is shared between two such blocks.
We only used the associativity property of our “bracket” binary operator. The same algorithm would work for any associative operator. For instance, we can use it to compute the $i^{th}$ largest number in each window (working on a domain of multi-sets of at most $i$ elements).
The algorithm produces a static computation graph, which can be mapped directly to low level instructions with no comparison, no branching, no pointer, just applications the binary $\max$ operator to input or temporary variables assigned to outputs or other temporary variables (with at most $k$ live temporary variables at any instruction).

Implementation

This post is part of our “OCaml Blog”, so let’s write some (completely unidiomatic) OCaml :-)

Trying to be careful with index manipulations, the algorithm can be implemented as below:

val sliding_associative: 'a array -> ('a -> 'a -> 'a) -> int -> 'a array

let sliding_associative a f k =
  if k <= 0 then invalid_arg "sliding_associative";
  let n = Array.length a in
  if n < k then [||]
  else if k = 1 then a
  else if k = 2 then Array.init (n - 1) (fun i -> f a.(i) a.(i + 1))
  else
    let r = Array.make (n - k + 1) a.(0) in
    let b = Array.make (k - 2) a.(0) in
    begin try
      for row = 0 to (n - k) / (k + 1) do
        let ofs = row * (k + 1) in
        let ofs_k = ofs + k in

        b.(k - 3) <- f a.(ofs_k - 2) a.(ofs_k - 1);
        for i = k - 4 downto 0 do
          b.(i) <- f a.(ofs + i + 1) b.(i + 1)
        done;
        r.(ofs) <- f a.(ofs) b.(0);
        if ofs_k < n then begin
          let acc = ref a.(ofs + k) in
          for i = 1 to k - 2 do
            r.(ofs + i) <- f b.(i - 1) !acc;
            if ofs_k + i = n then raise_notrace Exit;
            acc := f !acc a.(ofs_k + i)
          done;
          r.(ofs_k - 1) <- f a.(ofs_k - 1) !acc;
          if ofs_k + k - 1 < n then r.(ofs_k) <- f !acc a.(ofs_k + k - 1);
        end
      done
    with Exit -> ()
    end;
    r

Optimality?

Is this algorithm optimal, in the sense that it requires the smallest possible number of applications of the binary $\max$ operator? I don’t know!

I’ve tried to find empirical optimal solutions, applying some restrictions to which solutions are considered.

Namely, let’s consider computation graphs where each intermediate result represents the $\max$ of a consecutive segment $x_i,\ldots,x_j$ ($i < j$). This excludes computing e.g. $[x_1 x_3]$. If we look at all intermediate results, i.e. all such segments $x_i,\ldots,x_j$ involved in the computation, a graph can be represented by a set $X$ of pairs $(i,j)$ with $0 \le i < j < n$, with the property that for any $(i,j) \in X$ there exists some $i \le l < j$ such that: (i) either $l=i$ or $(i,l) \in X$, and (ii) either $l+1=j$ or $(l+1,j) \in X$. This property means that any intermediate result $\max(x_i,\ldots,x_j)$ is obtained by splitting the segment into two sub-segments (possibly reduced to a single $x_i$), whose $\max$ has also been computed. This excludes relying on the idempotence of $\max$ to compute e.g. $\max(x_1,x_2,x_3,x_4,x_5)$ as $\max(\max(x_1,x_2,x_3),\max(x_3,x_4,x_5))$.

These restrictions correspond to the fact that the only property of the binary operator that we use is its associativity.

Moreover, we require that all pairs $(i, i+k-1)$ for $0 \le i \le n-k$ are in $X$, i.e. the graph actually computes the required outputs.

It’s rather straightforward to implement a brute-force search of such set $X$ with minimal size. For $k=4$ and $k=5$, I’ve confirmed that we get the same result as the algorithm presented above for all $n \le 18$; and also for $k=6$ for all $n \le 16$; for $k=7$ for all $n \le 15$.

I haven’t tried to prove that the algorithm is indeed optimal, at least with the restrictions above. I’d be interested to hear if somebody gives it a try, or find better solutions lifting the restrictions. Also, I’d be surprised if the algorithm wasn’t already known in some form or another. Let me know if you had already heard about it!

Application

The curious reader might be interested to know how we got interested with this problem at LexiFi. As you might know, our software allows modelling and pricing ad hoc financial products. A few weeks ago, we have been submitted a product with a rather complex definition, involving computing on a daily basis the $\max$ of some quoted underlying over the last 60 market observations (and then doing some more complex crazy stuff over this series). Pricing such product involves “running” this payoff formula many thousands of times over different simulated market trajectories (a.k.a. a Monte Carlo simulation), and it is thus of paramount importance that the code to evaluate the payoff be as fast as possible. As a matter of fact, we compile (just-in-time) our internal payoff description into some native code which we execute many times. For a product with such sliding-max feature, we could observe a 10 fold reduction in RAM usage for the internal product representation, and a similar gain in terms of pricing code generation and execution time

It is rather common that changing how a given product is represented can impact significantly the performance of our software, but it is very infrequent that we need to resort to such a non-trivial technique (albeit quite simple once you know about it!).

Another algorithm

Googling for “sliding window maximum algorithm”, one can find another efficient and very different algorithm, which consists in using a “deque” data structure to keep track, while “sliding the k-window” over the initial sequence, of at most $k$ elements from the current window in decreasing order. When the window moves to the next position, the new element is inserted in the data structure, removing all elements which are smaller than it from the deque (they are found at the beginning) and also the element which just moved out of the window. See e.g. this website or this one. This algorithm uses a similar number of operations, but a more advanced data structure, which cannot be expressed as a simple static computation graph (the counting of operations relies on an amortized analysis: each element is inserted and removed at most once from the deque, but when a new element is added, the number of elements dropped from the deque can be as large as $k$). Moreover, it does not seem that this algorithm extends to arbitrary associative operators.