6. A Parallel Cost Model for Futhark Programs¶
In this chapter we develop a more formal model for Futhark and provide an ideal cost model for the language in terms of the concepts of work and span. Before we present the cost model for the language, we present a simple type system for Futhark and an evaluation semantics. In the initial development, we shall not consider Futhark’s more advanced features such as loops and uniqueness types, but we shall return to these constructs later in the chapter.
Futhark supports certain kinds of nested parallelism. For instance, Futhark can in many cases map two nested maps into fully parallel code. Consider the following Futhark function:
def multable (n : i32) : [n][n]i32 =
map (\i >
map (\j > i * j) (iota n))
(iota n)
In the case of this program, Futhark will flatten the code to make a single flat kernel. We shall return to the concept of flattening in a later chapter.
When we shall understand how efficient an algorithm is, we shall build
our analysis around the two concepts of work and span. These
concepts are defined inductively over the various Futhark language
constructs and we may therefore argue about work and span in a
compositional way. For instance, if we want to know about the work
required to execute the multable
function, we need to know about
how to compute the work for a call to the map
SOAC, how to compute
the work for the iota
operation, how to compute the work for the
multiply operation, and, finally, how to combine the work. The way to
determine the work for a map
SOAC instance is to multiply the size
of the argument array with the work of the body of the argument
function. Thus, we have
Applying a similar argument to the outer map, we get
Most often we are interested in finding the asymptotical complexity of the algorithm we are analyzing, in which case we will simply write
In a similar way we can derive that the span of a call multable n
,
written \(S(\kw{multable n})\), is \(O(1)\).
6.1. Futhark  the Language¶
In this section we present a simplified version of the Futhark language in terms of syntax, a type system for the language, and a strict evaluation semantics.
We assume a countable infinite number of program variables, ranged over by \(x\) and \(f\). Binary infix scalar operators, firstorder builtin operations, and second order array combinators are given as follows:
In the grammar for the Futhark language below, we have eluded both the required explicit type annotations and the optional explicit type annotations. Also for simplicity, we are considering only “unnested” pattern matching and we do not, in this section, consider uniqueness types.
6.2. Futhark Type System¶
Without considering Futhark’s uniqueness type system, Futhark’s type system is simple. Types (\(\tau\)) follow the following grammarslightly simplified:
We shall refer to the types i32
, f32
, and bool
as basic types. Futhark supports more basic types than those
presented here; consult Section 2.1 for a complete list.
In practice, Futhark requires a programmer to provide explicit
parameter types and an explicit result type for toplevel function
declarations. Similarly, in practice, Futhark requires explicit types
for toplevel let
bindings. In such explicit types, type variables
are not allowed; at present, Futhark does not allow for a programmer
to declare polymorphic functions.
Futhark’s second order array combinators and some of its primitive operations do have polymorphic types, which we specify by introducing the concept of type schemes, ranged over by \(\sigma\), which are basically quantified types with \(\alpha\) and \(\beta\) ranging over ordinary types. When \(\sigma=\forall\vec{\alpha}.\tau\) is some type scheme, we say that \(\tau'\) is an instance of \(\sigma\), written \(\sigma \geq \tau'\) if there exists a substitution \([\vec{\tau}/\vec{\alpha}]\) such that \(\tau[\vec{\tau}/\vec{\alpha}] = \tau'\). We require all substitutions to be simple in the sense that substitutions do not allow for function types, product types, or type variables to be substituted. Other restrictions may apply, which will be specified using a type variable constraint \(\alpha \triangleright T\), where \(T\) is a set of basic types.
The type schemes for Futhark’s secondorder array combinators are as follows:
The type schemes for Futhark’s builtin firstorder operations are as follows:
The type schemes for Futhark’s builtin infix scalar operations are as follows:
We use \(\Gamma\) to range over type environments, which are finite maps mapping variables to types. We use \(\{\}\) to denote the empty type environment and \(\{x:\tau\}\) to denote a singleton type environment. When \(\Gamma\) is some type environment, we write \(\Gamma,x:\tau\) to denote the type environment with domain \(\Dom(\Gamma) \cup \{x\}\) and values \((\Gamma,x:\tau)(y) = \tau\) if \(y=x\) and \(\Gamma(y)\), otherwise. Moreover, when \(\Gamma\) and \(\Gamma'\) are type environments, we write \(\Gamma + \Gamma'\) to denote the type environment with domain \(\Dom(\Gamma) \cup \Dom(\Gamma')\) and values \((\Gamma + \Gamma')(x) = \Gamma'(x)\) if \(x \in \Dom(\Gamma')\) and \(\Gamma(x)\), otherwise.
Type judgments for values take the form \(\vd v : \tau\), which are read “the value \(v\) has type \(\tau\).” Type judgments for expressions take the form \(\Gamma \vd e : \tau\), which are read “in the type environment \(\Gamma\), the expression \(e\) has type \(\tau\).” Finally, type judgments for programs take the form \(\Gamma \vd P : \Gamma'\), which are read “in the type environment \(\Gamma\), the program \(P\) has type environment \(\Gamma'\).”
Values \(~~\sembox{ \vd v : \tau}\)
Expressions \(~~\sembox{\Gamma \vd e : \tau}\)
Functions \(~~\sembox{\Gamma \vd F : \tau}\)
Programs \(~~\sembox{\Gamma \vd P : \Gamma'}\)
For brevity, we have eluded some of the typing rules and we leave it
to the reader to create typing rules for rearrange
, shape
,
reshape
, loopfor
, loopwhile
, and array ranging (e[i:j:o]
).
6.3. Futhark Evaluation Semantics¶
In this section we develop a simple evaluation semantics for Futhark programs. The semantics is presented as a big step evaluation function that takes as parameter an expression and gives as a result a value. A soundness property states that if a program \(P\) is welltyped and contains a function main of type \(() \rightarrow \tau\), then, if evaluation of the program results in a value \(v\), the value \(v\) has type \(\tau\).
To ease the presentation, we treat the evaluation function as being implicitly parameterised by the program \(P\).
The semantics of types yields their natural set interpretations:
For ease of presentation, we consider a syntactic vector value \(\kw{[}v_1,\cdots,v_n\kw{]}\) equal to the projection function on the vector, returning a default value of the underlying type for indexes greater than \(n1\) (zerobased interpretation).
For builtin operators \(\id{op}_\tau\), annotated with their type instance \(\tau\) according to the typing rules, we assume a semantic function \(\Eval{\id{op}_\tau} : \Eval{\tau}\). As an examples, we assume \(\Eval{\kw{+}_{\kw{i32}\rightarrow\kw{i32}\rightarrow\kw{i32}}} : \Z \rightarrow \Z \rightarrow \Z\).
When \(e\) is some expression, we write \(e[v_1/x_1,\cdots,v_n/x_n]\) to denote the simultaneous substitution of \(v_1,\cdots,v_n\) for \(x_1,\cdots,x_n\) (after appropriate renaming of bound variables) in \(e\).
Evaluation of an expression \(e\) is defined by an evaluation function \(\Eval{\cdot} : \mathrm{Exp} \rightarrow \mathrm{Val}\). The function is defined in a mutually recursive fashion with an auxiliary utility function \(\extractF{F}\) for extracting SOAC function parameters. We first give the definition for \(\Eval{\cdot}\):
Given a SOAC function parameter \(F\), we define the utility extraction function, \(\extractF{F}\), as follows:
Type soundness is expressed by the following proposition:
Proposition: Futhark Type Soundness
If \(~\vd P : \Gamma\) and \(\Gamma(\id{main}) = () \rightarrow \tau\) and \(\Eval{\id{main}~\kw{()}} = v\) then \(~\vd v : \tau\).
Notice that we have glanced over the concept of bounds checking by assuming that arrays with elements of type \(\tau\) are implemented as total functions from \(N\) to \(\Eval{\tau}\).
6.4. Work and Span¶
In this section we give a cost model for Futhark in terms of functions
for determining the total work done by a program, in terms of
operations done by the bigstep evaluation semantics, and the span
of the program execution, in terms of the maximum depth of the
computation, assuming an infinite amount of parallelism in the SOAC
computations. The functions for work and span, denoted by \(W :
\mathrm{Exp} \rightarrow \N\) and \(S : \mathrm{Exp} \rightarrow
\N\) are given below. The functions are defined independently,
although they make use of the evaluation function
\(\Eval{\cdot}\). We have given the definitions for the essential
SOAC functions, namely map
and reduce
. The definitions for
the remaining SOACs follow the same lines as the definitions for
map
and reduce
.
Work (\(W\))
Span (\(S\))
6.5. Reduction by Contraction¶
In this section, we shall investigate an implementation of reduction using the general concept of contraction, which is the general algorithmic trick of solving a particular problem by first making a contraction step, which simplifies the problem size, and then repeating the contraction algorithm until a final result is reached [Org16].
The reduction algorithm that we shall implement assumes an associative reduction operator \(\oplus : A \rightarrow A \rightarrow A\), a neutral element of type \(A\), and a vector \(v\) of size \(2^n\), containing elements of type \(A\). If \(\mathrm{size}(v) = 1\), the algorithm returns the single element. Otherwise, the algorithm performs a contraction by splitting the vector in two and applies the reduction operator elementwise on the two subvectors, thereby obtaining a contracted vector, which is then used as input to a recursive call to the algorithm. In Futhark, the function can be implemented as follows:
def red (xs : []i32) : i32 =
let xs = loop xs=padpow2 0 xs while length xs > 1 do
let n = length xs / 2
in map2 (+) (take n xs) (take n (drop n xs))
The function specializes the reduction operator \(\oplus\) to be
+
and the neutral element to be 0
. The function first pads the
argument vector xs
with neutral elements to ensure that its size
is a power of two. It then implements a sequential loop with the
contraction step as its loop body, implemented by a parallel
\(\fop{map}\) over an appropriately split input vector.
The auxiliary function for padding the input vector is implemented by the following code:
 Find the smallest power of two greater than n
def nextpow2 (n:i32) : i32 =
loop a=2 while a < n do 2*a
 Pad a vector to make its size a power of two
def padpow2 [n] (ne: i32) (v:[n]i32) : []i32 =
concat v (replicate (nextpow2 n  n) ne)
6.5.1. Determining Work and Span¶
To determine the work and span of the algorithm red
, we first
determine the work and span for padpow2
, for which we again need
to determine the work and span for nextpow2
. From simple
inspection we have \(W(\kw{nextpow2 n}) = S(\kw{nextpow2 n}) =
O(\mathrm{log}\,\kw{n})\). Now, from the definition of \(W\) and
\(S\) and because \(\kw{nextpow2 n} \leq 2\,\kw{n}\), we have
and
where \(\kw{n} = \size~\kw{v}\).
Each loop iteration in has span \(O(1)\). Because the loop is iterated atmost \(\log(2\,\kw{n})\) times, we have (where \(\kw{n} = \size\,\kw{v}\))
It is an exercise for the reader to compare the performance of the
reduction code to the performance of Futhark’s builtin reduce
SOAC (see Section 3.2).
6.6. RadixSort by Contraction¶
Another example of a contractionbased algorithm is radixsort. Radixsort is a noncomparison based sorting routine, which implements sorting by iteratively moving elements with a particular bit set to the beginning (or end) in the array. It turns out that this move of elements with the same bit set can be parallelised. Thus, for arrays containing 32bit unsigned integers, the sorting routine needs only 32 loopiterations to sort the array. A central property of each step is that elements with identical bit values will not shift position. Depending on whether the algorithm consistently moves elements with the bit set to the end of the array or to the beginning of the array results in the array being sorted in either ascending or descending order.
6.6.1. RadixSort in Futhark¶
A radixsort algorithm that sorts the argument vector in ascending order is shown below:
def rsort_step [n] (xs: [n]u32, bitn: i32): [n]u32 =
let bits1 = map (\x > (i32.u32 (x >> u32.i32 bitn)) & 1) xs
let bits0 = map (1) bits1
let idxs0 = map2 (*) bits0 (scan (+) 0 bits0)
let idxs1 = scan (+) 0 bits1
let offs = reduce (+) 0 bits0
let idxs1 = map2 (*) bits1 (map (+offs) idxs1)
let idxs = map2 (+) idxs0 idxs1
let idxs = map (\x>x1) idxs
in scatter (copy xs) idxs xs
 Radix sort algorithm, ascending
def rsort [n] (xs: [n]u32): [n]u32 =
loop (xs) for i < 32 do rsort_step(xs,i)
The function rsort_step
implements the contraction step that takes
care of moving all elements with the bitn
set to the end of the
array. The main function rsort
takes care of iterating the
contraction step until the array is sorted (i.e., when the contraction
step has been executed for all bits.) To appreciate the purpose of
each dataparallel operation in the function, the table below
illustrates how rsort_step
takes care of moving elements with a
particular bit set (bit 1) to the end of the array. The example
assumes the current array (xs
) contains the array
[2,0,6,4,2,1,5,9]
. Notice that the last three values all have
their 0bit set whereas the first five values have not. The values
of xs
marked with \(\dagger\) are the ones with bit 1 set.
Variable 



\(2^\dagger\) 
0 
\(2^\dagger\) 
4 
\(2^\dagger\) 
1 
5 
9 

1 
0 
1 
0 
1 
0 
0 
0 

0 
1 
0 
1 
0 
1 
1 
1 

0 
1 
1 
2 
2 
3 
4 
5 

0 
1 
0 
2 
0 
3 
4 
5 

1 
1 
2 
2 
3 
3 
3 
3 

6 
6 
7 
7 
8 
8 
8 
8 

6 
0 
7 
0 
8 
0 
0 
0 

6 
1 
7 
2 
8 
3 
4 
5 

5 
0 
6 
1 
7 
2 
3 
4 
By a straightforward analysis, we can argue that
\(W(\kw{rsort}~\kw{v})
= O(\kw{n})\), where \(n = \length\,\kw{v}\); each of the operations in
has work \(O(\kw{n})\) and rsort_step
is called a
constant number of times (i.e., 32 times). Similarly, we can argue that
\(S(\kw{rsort}~\kw{v}) = O(\log\,\kw{n})\), dominated by the SOAC
calls in rsort_step
.
6.7. Counting Primes¶
A variant of a contraction algorithm is an algorithm that first solves a smaller problem, recursively, and then uses this result to provide a solution to the larger problem. One such algorithm is a version of the Sieve of Eratosthenes that, to find the primes smaller than some \(n\), first calculates the primes smaller than \(\sqrt n\). It then uses this intermediate result for sieving away the integers in the range \(\sqrt n\) up to \(n\) that are multiples of the primes smaller than \(\sqrt n\).
Unfortunately, Futhark does not presently support recursion, thus, one needs to use a \(\fw{loop}\) construct instead to implement the sieve. A Futhark program calculating the number of primes below some number \(n\), also denoted in the literature as the \(\pi\) function, is shown below:
 Find the first n primes
def primes (n:i32) : []i32 =
let (acc, _) = loop (acc,c) = ([],2) while c < n+1 do
let c2 = i32.min (c * c) (n+1)
let is = map (+c) (iota(c2c))
let fs = map (\i >
let xs = map (\p > if i%p==0 then 1
else 0) acc
in reduce (+) 0 xs) is
 apply the sieve
let new = filter (\i > 0 == fs[ic]) is
in (concat acc new, c2)
in acc
 Return the number of primes less than n
def main (n:i32) : i32 =
let ps = primes n in length ps
Notice that the algorithm applies a parallel sieve for each step, using a combination of maps and reductions. The best known sequential algorithm for finding the number of primes below some \(n\) takes time \(O(n\,\log\,\log\,n)\). Although the present algorithm is quite efficient in practice, it is not work effcient, since the work inside the loop is superlinear. The loop itself introduces a span with a \(\log\,\log\,n\) factor because the size of the problem is squared at each step, which is identical to doubling the exponent size at each step (i.e., the sequence \(2^2, 2^4, 2^8, 2^{16}, \ldots, n\), where \(n=2^{2^m}\), for some positive \(m\), has \(m = \log\,\log\,n\) elements.)
In Section 10.4 we discuss the possibility of using a flattening approach to implement a workefficient parallel SieveofErastothenes algorithm.