# 5. A Parallel Cost Model for Futhark Programs¶

In this chapter we develop a more formal model for Futhark and provide an ideal cost model for the language in terms of the concepts of work and span. Before we present the cost model for the language, we present a simple type system for Futhark and an evaluation semantics. In the initial development, we shall not consider Futhark’s more advanced features such as loops and uniqueness types, but we shall return to these constructs later in the chapter.

Futhark supports certain kinds of nested parallelism. For instance, Futhark can in many cases map two nested maps into fully parallel code. Consider the following Futhark function:

```
def multable (n : i32) : [n][n]i32 =
map (\i ->
map (\j -> i * j) (iota n))
(iota n)
```

In the case of this program, Futhark will flatten the code to make a single flat kernel. We shall return to the concept of flattening in a later chapter.

When we shall understand how efficient an algorithm is, we shall build
our analysis around the two concepts of work and span. These
concepts are defined inductively over the various Futhark language
constructs and we may therefore argue about work and span in a
compositional way. For instance, if we want to know about the work
required to execute the `multable`

function, we need to know about
how to compute the work for a call to the `map`

SOAC, how to compute
the work for the `iota`

operation, how to compute the work for the
multiply operation, and, finally, how to combine the work. The way to
determine the work for a `map`

SOAC instance is to multiply the size
of the argument array with the work of the body of the argument
function. Thus, we have

Applying a similar argument to the outer map, we get

Most often we are interested in finding the asymptotical complexity of the algorithm we are analyzing, in which case we will simply write

In a similar way we can derive that the span of a call `multable n`

,
written \(S(\kw{multable n})\), is \(O(1)\).

## 5.1. Futhark - the Language¶

In this section we present a simplified version of the Futhark language in terms of syntax, a type system for the language, and a strict evaluation semantics.

We assume a countable infinite number of program variables, ranged over by \(x\) and \(f\). Binary infix scalar operators, first-order built-in operations, and second order array combinators are given as follows:

In the grammar for the Futhark language below, we have eluded both the required explicit type annotations and the optional explicit type annotations. Also for simplicity, we are considering only “unnested” pattern matching and we do not, in this section, consider uniqueness types.

## 5.2. Futhark Type System¶

Without considering Futhark’s uniqueness type system, Futhark’s type system is simple. Types (\(\tau\)) follow the following grammar-slightly simplified:

We shall refer to the types `i32`

, `f32`

, and `bool`

as *basic types*. Futhark supports more basic types than those
presented here; consult Section 2.1 for a complete list.

In practice, Futhark requires a programmer to provide explicit
parameter types and an explicit result type for top-level function
declarations. Similarly, in practice, Futhark requires explicit types
for top-level `let`

bindings. In such explicit types, type variables
are not allowed; at present, Futhark does not allow for a programmer
to declare polymorphic functions.

Futhark’s second order array combinators and some of its primitive
operations do have *polymorphic* types, which we specify by
introducing the concept of *type schemes*, ranged over by
\(\sigma\), which are basically quantified types with
\(\alpha\) and \(\beta\) ranging over ordinary types. When
\(\sigma=\forall\vec{\alpha}.\tau\) is some type scheme, we say
that \(\tau'\) is an instance of \(\sigma\), written
\(\sigma \geq \tau'\) if there exists a substitution
\([\vec{\tau}/\vec{\alpha}]\) such that
\(\tau[\vec{\tau}/\vec{\alpha}] = \tau'\). We require all
substitutions to be *simple* in the sense that substitutions do not
allow for function types, product types, or type variables to be
substituted. Other restrictions may apply, which will be specified
using a *type variable constraint* \(\alpha \triangleright T\),
where \(T\) is a set of basic types.

The type schemes for Futhark’s second-order array combinators are as follows:

The type schemes for Futhark’s built-in first-order operations are as follows:

The type schemes for Futhark’s built-in infix scalar operations are as follows:

We use \(\Gamma\) to range over *type environments*, which are
finite maps mapping variables to types. We use \(\{\}\) to denote
the empty type environment and \(\{x:\tau\}\) to denote a singleton
type environment. When \(\Gamma\) is some type environment, we write
\(\Gamma,x:\tau\) to denote the type environment with domain
\(\Dom(\Gamma) \cup \{x\}\) and values
\((\Gamma,x:\tau)(y) = \tau\) if \(y=x\) and \(\Gamma(y)\),
otherwise. Moreover, when \(\Gamma\) and \(\Gamma'\) are type
environments, we write \(\Gamma + \Gamma'\) to denote the type
environment with domain \(\Dom(\Gamma) \cup \Dom(\Gamma')\) and
values \((\Gamma + \Gamma')(x) = \Gamma'(x)\) if
\(x \in \Dom(\Gamma')\) and \(\Gamma(x)\), otherwise.

Type judgments for values take the form \(\vd v : \tau\), which are read “the value \(v\) has type \(\tau\).” Type judgments for expressions take the form \(\Gamma \vd e : \tau\), which are read “in the type environment \(\Gamma\), the expression \(e\) has type \(\tau\).” Finally, type judgments for programs take the form \(\Gamma \vd P : \Gamma'\), which are read “in the type environment \(\Gamma\), the program \(P\) has type environment \(\Gamma'\).”

Values \(~~\sembox{ \vd v : \tau}\)

Expressions \(~~\sembox{\Gamma \vd e : \tau}\)

Functions \(~~\sembox{\Gamma \vd F : \tau}\)

Programs \(~~\sembox{\Gamma \vd P : \Gamma'}\)

For brevity, we have eluded some of the typing rules and we leave it
to the reader to create typing rules for `rearrange`

, `shape`

,
`reshape`

, `loop-for`

, `loop-while`

, and array ranging (`e[i:j:o]`

).

## 5.3. Futhark Evaluation Semantics¶

In this section we develop a simple evaluation semantics for Futhark
programs. The semantics is presented as a *big step* evaluation function
that takes as parameter an expression and gives as a result a value. A
*soundness property* states that if a program \(P\) is well-typed
and contains a function *main* of type \(() \rightarrow
\tau\), then, if evaluation of the program results in a value \(v\),
the value \(v\) has type \(\tau\).

To ease the presentation, we treat the evaluation function as being implicitly parameterised by the program \(P\).

The semantics of types yields their natural set interpretations:

For ease of presentation, we consider a syntactic vector value \(\kw{[}v_1,\cdots,v_n\kw{]}\) equal to the projection function on the vector, returning a default value of the underlying type for indexes greater than \(n-1\) (zero-based interpretation).

For built-in operators \(\id{op}_\tau\), annotated with their type instance \(\tau\) according to the typing rules, we assume a semantic function \(\Eval{\id{op}_\tau} : \Eval{\tau}\). As an examples, we assume \(\Eval{\kw{+}_{\kw{i32}\rightarrow\kw{i32}\rightarrow\kw{i32}}} : \Z \rightarrow \Z \rightarrow \Z\).

When \(e\) is some expression, we write \(e[v_1/x_1,\cdots,v_n/x_n]\) to denote the simultaneous substitution of \(v_1,\cdots,v_n\) for \(x_1,\cdots,x_n\) (after appropriate renaming of bound variables) in \(e\).

Evaluation of an expression \(e\) is defined by an evaluation function \(\Eval{\cdot} : \mathrm{Exp} \rightarrow \mathrm{Val}\). The function is defined in a mutually recursive fashion with an auxiliary utility function \(\extractF{F}\) for extracting SOAC function parameters. We first give the definition for \(\Eval{\cdot}\):

Given a SOAC function parameter \(F\), we define the utility
*extraction function*, \(\extractF{F}\), as follows:

Type soundness is expressed by the following proposition:

Proposition: Futhark Type Soundness

If \(~\vd P : \Gamma\) and \(\Gamma(\id{main}) = () \rightarrow \tau\) and \(\Eval{\id{main}~\kw{()}} = v\) then \(~\vd v : \tau\).

Notice that we have glanced over the concept of bounds checking by assuming that arrays with elements of type \(\tau\) are implemented as total functions from \(N\) to \(\Eval{\tau}\).

## 5.4. Work and Span¶

In this section we give a cost model for Futhark in terms of functions
for determining the total *work* done by a program, in terms of
operations done by the big-step evaluation semantics, and the *span*
of the program execution, in terms of the maximum depth of the
computation, assuming an infinite amount of parallelism in the SOAC
computations. The functions for work and span, denoted by \(W :
\mathrm{Exp} \rightarrow \N\) and \(S : \mathrm{Exp} \rightarrow
\N\) are given below. The functions are defined independently,
although they make use of the evaluation function
\(\Eval{\cdot}\). We have given the definitions for the essential
SOAC functions, namely `map`

and `reduce`

. The definitions for
the remaining SOACs follow the same lines as the definitions for
`map`

and `reduce`

.

Work (\(W\))

Span (\(S\))

## 5.5. Reduction by Contraction¶

In this section, we shall investigate an implementation of reduction
using the general concept of *contraction*, which is the general
algorithmic trick of solving a particular problem by first making a
*contraction step*, which simplifies the problem size, and then
repeating the contraction algorithm until a final result is reached
[Org16].

The reduction algorithm that we shall implement assumes an associative reduction operator \(\oplus : A \rightarrow A \rightarrow A\), a neutral element of type \(A\), and a vector \(v\) of size \(2^n\), containing elements of type \(A\). If \(\mathrm{size}(v) = 1\), the algorithm returns the single element. Otherwise, the algorithm performs a contraction by splitting the vector in two and applies the reduction operator elementwise on the two subvectors, thereby obtaining a contracted vector, which is then used as input to a recursive call to the algorithm. In Futhark, the function can be implemented as follows:

```
def red (xs : []i32) : i32 =
let xs = loop xs=padpow2 0 xs while length xs > 1 do
let n = length xs / 2
in map2 (+) (take n xs) (take n (drop n xs))
```

The function specializes the reduction operator \(\oplus\) to be
`+`

and the neutral element to be `0`

. The function first pads the
argument vector `xs`

with neutral elements to ensure that its size
is a power of two. It then implements a sequential loop with the
contraction step as its loop body, implemented by a parallel
\(\fop{map}\) over an appropriately split input vector.

The auxiliary function for padding the input vector is implemented by the following code:

```
-- Find the smallest power of two greater than n
def nextpow2 (n:i32) : i32 =
loop a=2 while a < n do 2*a
-- Pad a vector to make its size a power of two
def padpow2 [n] (ne: i32) (v:[n]i32) : []i32 =
concat v (replicate (nextpow2 n - n) ne)
```

### 5.5.1. Determining Work and Span¶

To determine the work and span of the algorithm `red`

, we first
determine the work and span for `padpow2`

, for which we again need
to determine the work and span for `nextpow2`

. From simple
inspection we have \(W(\kw{nextpow2 n}) = S(\kw{nextpow2 n}) =
O(\mathrm{log}\,\kw{n})\). Now, from the definition of \(W\) and
\(S\) and because \(\kw{nextpow2 n} \leq 2\,\kw{n}\), we have

and

where \(\kw{n} = \size~\kw{v}\).

Each loop iteration in has span \(O(1)\). Because the loop is iterated at-most \(\log(2\,\kw{n})\) times, we have (where \(\kw{n} = \size\,\kw{v}\))

It is an exercise for the reader to compare the performance of the
reduction code to the performance of Futhark’s built-in `reduce`

SOAC (see Section 3.2).

## 5.6. Radix-Sort by Contraction¶

Another example of a contraction-based algorithm is radix-sort. Radix-sort is a non-comparison based sorting routine, which implements sorting by iteratively moving elements with a particular bit set to the beginning (or end) in the array. It turns out that this move of elements with the same bit set can be parallelised. Thus, for arrays containing 32-bit unsigned integers, the sorting routine needs only 32 loop-iterations to sort the array. A central property of each step is that elements with identical bit values will not shift position. Depending on whether the algorithm consistently moves elements with the bit set to the end of the array or to the beginning of the array results in the array being sorted in either ascending or descending order.

### 5.6.1. Radix-Sort in Futhark¶

A radix-sort algorithm that sorts the argument vector in ascending order is shown below:

```
def rsort_step [n] (xs: [n]u32, bitn: i32): [n]u32 =
let bits1 = map (\x -> (i32.u32 (x >> u32.i32 bitn)) & 1) xs
let bits0 = map (1-) bits1
let idxs0 = map2 (*) bits0 (scan (+) 0 bits0)
let idxs1 = scan (+) 0 bits1
let offs = reduce (+) 0 bits0
let idxs1 = map2 (*) bits1 (map (+offs) idxs1)
let idxs = map2 (+) idxs0 idxs1
let idxs = map (\x->x-1) idxs
in scatter (copy xs) idxs xs
-- Radix sort algorithm, ascending
def rsort [n] (xs: [n]u32): [n]u32 =
loop (xs) for i < 32 do rsort_step(xs,i)
```

The function `rsort_step`

implements the contraction step that takes
care of moving all elements with the `bitn`

set to the end of the
array. The main function `rsort`

takes care of iterating the
contraction step until the array is sorted (i.e., when the contraction
step has been executed for all bits.) To appreciate the purpose of
each data-parallel operation in the function, the table below
illustrates how `rsort_step`

takes care of moving elements with a
particular bit set (bit 1) to the end of the array. The example
assumes the current array (`xs`

) contains the array
`[2,0,6,4,2,1,5,9]`

. Notice that the last three values all have
their 0-bit set whereas the first five values have not. The values
of `xs`

marked with \(\dagger\) are the ones with bit 1 set.

Variable | ||||||||
---|---|---|---|---|---|---|---|---|

`xs` |
\(2^\dagger\) | 0 | \(2^\dagger\) | 4 | \(2^\dagger\) | 1 | 5 | 9 |

`bits1` |
1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |

`bits0` |
0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |

`scan (+) 0 bits0` |
0 | 1 | 1 | 2 | 2 | 3 | 4 | 5 |

`idxs0` |
0 | 1 | 0 | 2 | 0 | 3 | 4 | 5 |

`idxs1` |
1 | 1 | 2 | 2 | 3 | 3 | 3 | 3 |

`idxs1'` |
6 | 6 | 7 | 7 | 8 | 8 | 8 | 8 |

`idxs1''` |
6 | 0 | 7 | 0 | 8 | 0 | 0 | 0 |

`idxs` |
6 | 1 | 7 | 2 | 8 | 3 | 4 | 5 |

`map (-1) idxs` |
5 | 0 | 6 | 1 | 7 | 2 | 3 | 4 |

By a straightforward analysis, we can argue that
\(W(\kw{rsort}~\kw{v})
= O(\kw{n})\), where \(n = \length\,\kw{v}\); each of the operations in
has work \(O(\kw{n})\) and `rsort_step`

is called a
constant number of times (i.e., 32 times). Similarly, we can argue that
\(S(\kw{rsort}~\kw{v}) = O(\log\,\kw{n})\), dominated by the SOAC
calls in `rsort_step`

.

## 5.7. Counting Primes¶

A variant of a contraction algorithm is an algorithm that first solves a smaller problem, recursively, and then uses this result to provide a solution to the larger problem. One such algorithm is a version of the Sieve of Eratosthenes that, to find the primes smaller than some \(n\), first calculates the primes smaller than \(\sqrt n\). It then uses this intermediate result for sieving away the integers in the range \(\sqrt n\) up to \(n\) that are multiples of the primes smaller than \(\sqrt n\).

Unfortunately, Futhark does not presently support recursion, thus, one needs to use a \(\fw{loop}\) construct instead to implement the sieve. A Futhark program calculating the number of primes below some number \(n\), also denoted in the literature as the \(\pi\) function, is shown below:

```
-- Find the first n primes
def primes (n:i32) : []i32 =
let (acc, _) = loop (acc,c) = ([],2) while c < n+1 do
let c2 = i32.min (c * c) (n+1)
let is = map (+c) (iota(c2-c))
let fs = map (\i ->
let xs = map (\p -> if i%p==0 then 1
else 0) acc
in reduce (+) 0 xs) is
-- apply the sieve
let new = filter (\i -> 0 == fs[i-c]) is
in (concat acc new, c2)
in acc
-- Return the number of primes less than n
def main (n:i32) : i32 =
let ps = primes n in length ps
```

Notice that the algorithm applies a parallel sieve for each step, using a combination of maps and reductions. The best known sequential algorithm for finding the number of primes below some \(n\) takes time \(O(n\,\log\,\log\,n)\). Although the present algorithm is quite efficient in practice, it is not work effcient, since the work inside the loop is super-linear. The loop itself introduces a span with a \(\log\,\log\,n\) factor because the size of the problem is squared at each step, which is identical to doubling the exponent size at each step (i.e., the sequence \(2^2, 2^4, 2^8, 2^{16}, \ldots, n\), where \(n=2^{2^m}\), for some positive \(m\), has \(m = \log\,\log\,n\) elements.)

In Section 9.4 we discuss the possibility of using a flattening approach to implement a work-efficient parallel Sieve-of-Erastothenes algorithm.