6. A Parallel Cost Model for Futhark Programs¶

In this chapter we develop a more formal model for Futhark and provide an ideal cost model for the language in terms of the concepts of work and span. Before we present the cost model for the language, we present a simple type system for Futhark and an evaluation semantics. In the initial development, we shall not consider Futhark’s more advanced features such as loops and uniqueness types, but we shall return to these constructs later in the chapter.

Futhark supports certain kinds of nested parallelism. For instance, Futhark can in many cases map two nested maps into fully parallel code. Consider the following Futhark function:

def multable (n : i64) : [n][n]i64 =
  map (\i ->
    map (\j -> i * j) (iota n))
      (iota n)

In the case of this program, Futhark will flatten the code to make a single flat kernel. We shall return to the concept of flattening in a later chapter.

When we shall understand how efficient an algorithm is, we shall build our analysis around the two concepts of work and span. These concepts are defined inductively over the various Futhark language constructs and we may therefore argue about work and span in a compositional way. For instance, if we want to know about the work required to execute the multable function, we need to know about how to compute the work for a call to the map SOAC, how to compute the work for the iota operation, how to compute the work for the multiply operation, and, finally, how to combine the work. The way to determine the work for a map SOAC instance is to multiply the size of the argument array with the work of the body of the argument function. Thus, we have

\[W(\fop{map}~(\lamK{j}{i * j})~(\fop{iota}~n)) = n+1\]

Applying a similar argument to the outer map, we get

\[W(\fop{map}~(\lamK{i}{$\cdots$})~(\fop{iota}~\kw{$n$})) = (n+1)^2\]

Most often we are interested in finding the asymptotical complexity of the algorithm we are analyzing, in which case we will simply write

\[W(\fop{map}~(\lamK{i}{$\cdots$}) (\fop{iota}~n) = O(n^2)\]

In a similar way we can derive that the span of a call multable n, written $S(\kw{multable n})$, is $O(1)$.

6.1. Futhark - the Language¶

In this section we present a simplified version of the Futhark language in terms of syntax, a type system for the language, and a strict evaluation semantics.

We assume a countable infinite number of program variables, ranged over by $x$ and $f$. Binary infix scalar operators, first-order built-in operations, and second order array combinators are given as follows:

\[\begin{split}\id{binop} &::=~~ \fop{+} ~~|~~ \fop{-} ~~|~~ \fop{*} ~~|~~ \fop{/} ~~|~~ \cdots \\ \\ \id{op} &::=~~ \fop{-} ~~|~~ \fop{abs} ~~|~~ \fop{copy} ~~|~~ \fop{concat} ~~|~~ \fop{empty} \\ &~~|~~ \fop{iota} ~~|~~ \fop{partition} ~~|~~ \fop{rearrange} \\ &~~|~~ \fop{replicate} ~~|~~ \fop{reshape} \\ &~~|~~ \fop{rotate} ~~|~~ \fop{shape} ~~|~~ \fop{scatter} \\ &~~|~~ \fop{split} ~~|~~ \fop{transpose} ~~|~~ \fop{unzip} ~~|~~ \fop{zip} \\ \\ \id{soac} &::=~~ \fop{map} ~~|~~ \fop{reduce} \\ &~~|~~ \fop{scan} ~~|~~ \fop{filter} ~~|~~ \fop{partition}\end{split}\]

In the grammar for the Futhark language below, we have eluded both the required explicit type annotations and the optional explicit type annotations. Also for simplicity, we are considering only “unnested” pattern matching and we do not, in this section, consider uniqueness types.

\[\begin{split}p &::=~~ x ~~|~~ \kw{(}x_1,...,x_n\kw{)} \\ \\ \id{ps} &::=~~ p_1 \cdots p_n \\ \\ F &::=~~ \lam{ps}{e} ~~|~~ e~\id{binop} ~~|~~ \id{binop}~e \\ \\ P &::=~~ \fw{let}~f~\id{ps}~\kw{=}~e ~~|~~ P_1 P_2 ~~|~~ \fw{let}~p~\kw{=}~e \\ \\ v &::=~~ \fop{true} ~~|~~ \fop{false} ~~|~~ n ~~|~~ r \\ &~~|~~ \kw{[}v_1,...,v_n\kw{]} ~~|~~ \kw{(}v_1,...,v_n\kw{)} \\ \\ e &::=~~ x ~~|~~ v ~~|~~ \fw{let}~\id{ps}~\kw{=}~e~\fw{in}~e' \\ &~~|~~ e\kw{[}e'\kw{]} ~~|~~ e\kw{[}e'\kw{:}e''\kw{]} \\ &~~|~~ \kw{[}e_1,...,e_n\kw{]} ~~|~~ \kw{(}v_1,...,v_n\kw{)} \\ &~~|~~ f e_1 ... e_n ~~|~~ \id{op}~e_1 ... e_n ~~|~~ e_1~\id{binop}~e_2 \\ &~~|~~ \fw{loop}~ p_1\kw{=}e_1,\cdots,p_n\kw{=}e_n ~\fw{for}~ x \kw{<} e ~\fw{do}~ e' \\ &~~|~~ \fw{loop}~ p_1\kw{=}e_1,\cdots,p_n\kw{=}e_n ~\fw{while}~ e ~\fw{do}~ e' \\ &~~|~~ \id{soac}~F~e_1~\cdots~e_n\end{split}\]

6.2. Futhark Type System¶

Without considering Futhark’s uniqueness type system, Futhark’s type system is simple. Types ($\tau$) follow the following grammar-slightly simplified:

\[\begin{split}\tau & ::=~~ \kw{i32} ~~|~~ \kw{f32} ~~|~~ \kw{bool} ~~|~~ \kw{[]}\tau \\ & ~~|~~ \kw{(}\tau_1,\cdots,\tau_n\kw{)} ~~|~~ \tau \rarr \tau' ~~|~~ \alpha\end{split}\]

We shall refer to the types i32, f32, and bool as basic types. Futhark supports more basic types than those presented here; consult Section 2.1 for a complete list.

In practice, Futhark requires a programmer to provide explicit parameter types and an explicit result type for top-level function declarations. Similarly, in practice, Futhark requires explicit types for top-level let bindings. In such explicit types, type variables are not allowed; at present, Futhark does not allow for a programmer to declare polymorphic functions.

Futhark’s second order array combinators and some of its primitive operations do have polymorphic types, which we specify by introducing the concept of type schemes, ranged over by $\sigma$, which are basically quantified types with $\alpha$ and $\beta$ ranging over ordinary types. When $\sigma=\forall\vec{\alpha}.\tau$ is some type scheme, we say that $\tau'$ is an instance of $\sigma$, written $\sigma \geq \tau'$ if there exists a substitution $[\vec{\tau}/\vec{\alpha}]$ such that $\tau[\vec{\tau}/\vec{\alpha}] = \tau'$. We require all substitutions to be simple in the sense that substitutions do not allow for function types, product types, or type variables to be substituted. Other restrictions may apply, which will be specified using a type variable constraint $\alpha \triangleright T$, where $T$ is a set of basic types.

The type schemes for Futhark’s second-order array combinators are as follows:

\[\begin{split}\id{soac} & ~~:~~ \mathrm{TypeOf}(\id{soac}) \\ \hline \fop{filter} & ~~:~~ \forall \alpha. (\alpha \rarr \mathtt{bool}) \rarr []\alpha \rarr []\alpha\\ \fop{map} & ~~:~~ \forall \alpha_1\cdots\alpha_n\beta. (\alpha_1\rarr\cdots\rarr\alpha_n \rarr \beta) \\ & ~~~~~ \rarr []\alpha_1 \rarr\cdots\rarr []\alpha_n \rarr []\beta\\ \fop{reduce} & ~~:~~ \forall \alpha. (\alpha \rarr \alpha \rarr \alpha) \rarr \alpha \rarr []\alpha \rarr \alpha\\ \fop{scan} & ~~:~~ \forall \alpha. (\alpha \rarr \alpha \rarr \alpha) \rarr \alpha \rarr []\alpha \rarr []\alpha\\\end{split}\]

The type schemes for Futhark’s built-in first-order operations are as follows:

\[\begin{split} \id{op} & ~~:~~ \mathrm{TypeOf}(\id{op}) \\ \hline \fop{concat} & ~~:~~ \forall \alpha. []\alpha \rarr \cdots \rarr []\alpha \rarr []\alpha \\ \fop{empty} & ~~:~~ \forall \alpha. []\alpha \\ \fop{iota} & ~~:~~ \kw{int} \rarr []\kw{int} \\ \fop{replicate} & ~~:~~ \forall \alpha. \kw{int} \rarr \alpha \rarr []\alpha\\ \fop{rotate} & ~~:~~ \forall \alpha. \kw{int} \rarr []\alpha \rarr []\alpha\\ \fop{transpose} & ~~:~~ \forall \alpha. [][]\alpha \rarr [][]\alpha\\ \fop{unzip} & ~~:~~ \forall \alpha_1\cdots\alpha_n. [](\alpha_1,\cdots,\alpha_n) \\ & ~~~~~ \rarr ([]\alpha_1,\cdots,[]\alpha_n) \\ \fop{scatter} & ~~:~~ \forall \alpha. []\alpha \rarr []\kw{int} \rarr []\alpha \rarr []\alpha \\ \fop{zip} & ~~:~~ \forall \alpha_1\cdots\alpha_n. []\alpha_1\rarr\cdots\rarr[]\alpha_n \\ & ~~~~~ \rarr [](\alpha_1,\cdots,\alpha_n)\end{split}\]

The type schemes for Futhark’s built-in infix scalar operations are as follows:

\[\begin{split}\id{binop} & ~~:~~ \mathrm{TypeOf}(\id{binop}) \\ \hline \kw{+},\kw{-},\kw{*},\kw{/},\cdots & ~~:~~ \forall \alpha \triangleright \{\kw{i32},\kw{f32}\}. \alpha \rarr \alpha \rarr \alpha \\ \kw{==},\kw{!=},\kw{<},\kw{<=},\kw{>},\kw{>=} & ~~:~~ \forall \alpha \triangleright \{\kw{i32},\kw{f32}\}. \alpha \rarr \alpha \rarr \kw{bool}\end{split}\]

We use $\Gamma$ to range over type environments, which are finite maps mapping variables to types. We use $\{\}$ to denote the empty type environment and $\{x:\tau\}$ to denote a singleton type environment. When $\Gamma$ is some type environment, we write $\Gamma,x:\tau$ to denote the type environment with domain $\Dom(\Gamma) \cup \{x\}$ and values $(\Gamma,x:\tau)(y) = \tau$ if $y=x$ and $\Gamma(y)$, otherwise. Moreover, when $\Gamma$ and $\Gamma'$ are type environments, we write $\Gamma + \Gamma'$ to denote the type environment with domain $\Dom(\Gamma) \cup \Dom(\Gamma')$ and values $(\Gamma + \Gamma')(x) = \Gamma'(x)$ if $x \in \Dom(\Gamma')$ and $\Gamma(x)$, otherwise.

Type judgments for values take the form $\vd v : \tau$, which are read “the value $v$ has type $\tau$.” Type judgments for expressions take the form $\Gamma \vd e : \tau$, which are read “in the type environment $\Gamma$, the expression $e$ has type $\tau$.” Finally, type judgments for programs take the form $\Gamma \vd P : \Gamma'$, which are read “in the type environment $\Gamma$, the program $P$ has type environment $\Gamma'$.”

Values $~~\sembox{ \vd v : \tau}$

\[\begin{split}\fraccc{}{\vd r : \kw{f32}} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{}{\vd n : \kw{i32}} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{}{\vd \fop{true} : \kw{bool}} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{}{\vd \fop{false} : \kw{bool}} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\vd v_i : \tau_i ~~~i = [1;n]}{\vd \kw{(}v_1,\cdots,v_n\kw{)} : \kw{(}\tau_1,\cdots,\tau_n\kw{)}} \\[2mm]\end{split}\]

\[\fraccc{\vd v_i : \tau ~~~i = [1;n]}{\vd \kw{[}v_1,\cdots,v_n\kw{]} : \kw{[]}\tau}\]

Expressions $~~\sembox{\Gamma \vd e : \tau}$

\[\begin{split}\fraccc{\Gamma(x) = \tau}{\Gamma \vd x : \tau} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma \vd e : []\tau ~~~~\Gamma \vd e_i : \kw{int}~~i=[1,2]}{ \Gamma \vd e\kw{[}e_1:e_2\kw{]} : []\tau} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma \vd e : \tau ~~~~ \Gamma, x:\tau \vd e' : \tau'}{\Gamma \vd \fw{let}~x~\kw{=} ~e~ \fw{in}~e' : \tau'} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma \vd e : \kw{(}\tau_1,\cdots,\tau_n\kw{)} \\ \Gamma, x_1:\tau_1,\cdots,x_n:\tau_n \vd e' : \tau}{\Gamma \vd \fw{let}~\kw{(}x_1,\cdots,x_n\kw{)} ~\kw{=} ~e~ \fw{in}~e' : \tau} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma \vd e_i : \tau_i ~~~i = [1;n]}{\Gamma \vd \kw{(}e_1,\cdots,e_n\kw{)} : \kw{(}\tau_1,\cdots,\tau_n\kw{)}} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma \vd e_i : \tau ~~~i = [1;n]}{\Gamma \vd \kw{[}e_1,\cdots,e_n\kw{]} : \kw{[]}\tau} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\vd v : \tau}{\Gamma \vd v : \tau} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma(f) = \kw{(}\tau_1,\cdots,\tau_n\kw{)} \rarr \tau ~~~~ \Gamma \vd e_i : \tau_i ~~ i = [1;n]}{\Gamma \vd f~e_1\cdots e_n : \tau} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma \vd e_i : \tau_i ~~ i = [1;2] \\ \mathrm{TypeOf}(\id{binop}) \geq \tau ~~~~ \tau = \tau_1 \rarr \tau_2 \rarr \tau'}{\Gamma \vd e_1 ~\id{binop}_\tau~ e_2 : \tau' } \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma \vd e_i : \tau_i ~~ i = [1;n] \\ \mathrm{TypeOf}(\id{op}) \geq \tau \\ \tau = \tau_1 \rarr \cdots \rarr \tau_n \rarr \tau'}{\Gamma \vd \id{op}_\tau~ e_1~\cdots~e_n : \tau' } \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma \vd e : []\tau ~~~~~~\Gamma \vd e' : \kw{int}}{ \Gamma \vd e\kw{[}e'\kw{]} : \tau} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma \vd F : \tau_\mathrm{f} ~~~ \Gamma \vd e_i : \tau_i ~~ i = [1;n] \\ \mathrm{TypeOf}(\id{soac}) \geq \tau_\mathrm{f} \rarr \tau_1 \rarr \cdots \rarr \tau_n \rarr \tau}{ \Gamma \vd \id{soac} ~F~e_1~\cdots~e_n : \tau }\end{split}\]

Functions $~~\sembox{\Gamma \vd F : \tau}$

\[\begin{split}\fraccc{\Gamma,x_1:\tau_1~\cdots~x_n:\tau_n \vd e : \tau}{ \Gamma \vd \lam{x_1:\tau_1~\cdots~x_n:\tau_n}{e} : \tau_1 \rarr \cdots \rarr \tau_n \rarr \tau } \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma \vd e : \tau_1 \\ \mathrm{TypeOf}(\id{binop}) \geq \tau_1 \rarr \tau_2 \rarr \tau}{\Gamma \vd e ~\id{binop} : \tau_2 \rarr \tau } \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma \vd e : \tau_2 \\ \mathrm{TypeOf}(\id{binop}) \geq \tau_1 \rarr \tau_2 \rarr \tau}{\Gamma \vd \id{binop}~e : \tau_1 \rarr \tau }\end{split}\]

Programs $~~\sembox{\Gamma \vd P : \Gamma'}$

\[\begin{split}\fraccc{\Gamma \vd e : \tau ~~~~~ x \not \in \Dom(\Gamma)}{\Gamma \vd \fw{let}~x~\kw{=} ~e : \{x:\tau\}} \\[2mm]\end{split}\]

\[\begin{split}\fraccc{\Gamma \vd P_1 : \Gamma_1 ~~~~ \Gamma + \Gamma_1 \vd P_2 : \Gamma_2} {\Gamma \vd P_1~P_2 : \Gamma_1 + \Gamma_2} \\[2mm]\end{split}\]

\[\fraccc{\Gamma,x_1:\tau_1,\cdots,x_n:\tau_n \vd e : \tau ~~~~~ f \not \in \Dom(\Gamma)} {\Gamma \vd \fw{let}~f~\kw{(}x_1,\cdots,x_n\kw{)}~\kw{=} ~e : \{f:\kw{(}\tau_1,\cdots,\tau_n\kw{)} \rarr \tau\}}\]

For brevity, we have eluded some of the typing rules and we leave it to the reader to create typing rules for rearrange, shape, reshape, loop-for, loop-while, and array ranging (e[i:j:o]).

6.3. Futhark Evaluation Semantics¶

In this section we develop a simple evaluation semantics for Futhark programs. The semantics is presented as a big step evaluation function that takes as parameter an expression and gives as a result a value. A soundness property states that if a program $P$ is well-typed and contains a function main of type $() \rightarrow \tau$, then, if evaluation of the program results in a value $v$, the value $v$ has type $\tau$.

To ease the presentation, we treat the evaluation function as being implicitly parameterised by the program $P$.

The semantics of types yields their natural set interpretations:

\[\begin{split}\Eval{\kw{i32}} & =~~ \Z \\ \Eval{\kw{f32}} & =~~ \R \\ \Eval{\kw{bool}} & =~~ \{\fop{true},\fop{false}\} \\ \Eval{\kw{(}\tau_1,\cdots,\tau_n\kw{)}} & =~~ \Eval{\tau_1} \times \cdots \times \Eval{\tau_n} \\ \Eval{\kw{[]}\tau} & =~~ \N \rightarrow \Eval{\tau} \\ \Eval{\tau_1 \rightarrow \tau_2} & =~~ \Eval{\tau_1} \rightarrow \Eval{\tau_2}\end{split}\]

For ease of presentation, we consider a syntactic vector value $\kw{[}v_1,\cdots,v_n\kw{]}$ equal to the projection function on the vector, returning a default value of the underlying type for indexes greater than $n-1$ (zero-based interpretation).

For built-in operators $\id{op}_\tau$, annotated with their type instance $\tau$ according to the typing rules, we assume a semantic function $\Eval{\id{op}_\tau} : \Eval{\tau}$. As an examples, we assume $\Eval{\kw{+}_{\kw{i32}\rightarrow\kw{i32}\rightarrow\kw{i32}}} : \Z \rightarrow \Z \rightarrow \Z$.

When $e$ is some expression, we write $e[v_1/x_1,\cdots,v_n/x_n]$ to denote the simultaneous substitution of $v_1,\cdots,v_n$ for $x_1,\cdots,x_n$ (after appropriate renaming of bound variables) in $e$.

Evaluation of an expression $e$ is defined by an evaluation function $\Eval{\cdot} : \mathrm{Exp} \rightarrow \mathrm{Val}$. The function is defined in a mutually recursive fashion with an auxiliary utility function $\extractF{F}$ for extracting SOAC function parameters. We first give the definition for $\Eval{\cdot}$:

\[\begin{split}\Eval{f~e_1~\cdots~e_n} & =~~ \Eval{e[\Eval{e_1}/x_1\cdots \Eval{e_n}/x_n]} \\ & ~ ~~~~~ \mathrm{where}~\fw{let}~f~x_1~ \cdots x_n ~\kw{=}~ e \in P \\ \Eval{v} & =~~ v \\ \Eval{e\kw{[}e'\kw{]}} & =~~ \Eval{e}(\Eval{e'}) \\ \Eval{\fw{let}~x~\kw{=}~e~\fw{in}~e'} & =~~ \Eval{e'[\Eval{e}/x]} \\ \Eval{\fw{let}~\kw{(}x_1,\cdots,x_n\kw{)}~\kw{=}~e~\fw{in}~e'} & =~~ \Eval{e'[v_1/x_1\cdots v_n/x_n]} \\ & ~ ~~~~~ \mathrm{where}~ \Eval{e} = \kw{(}v_1,\cdots,v_n\kw{)} \\ \Eval{\kw{[}e_1,\cdots,e_n\kw{]}} & =~~ \kw{[}\Eval{e_1},\cdots,\Eval{e_n}\kw{]} \\ \Eval{\kw{(}e_1,\cdots,e_n\kw{)}} & =~~ \kw{(}\Eval{e_1},\cdots,\Eval{e_n}\kw{)} \\ \Eval{e_1~\id{binop}_\tau~e_2} & =~~ \sem{\id{binop}_\tau}~\Eval{e_1}~\Eval{e_2} \\ \Eval{\id{op}_\tau~e_1\cdots e_n} & =~~ \sem{\id{op}_\tau}~\Eval{e_1}~\cdots~\Eval{e_n} \\ \Eval{\fop{map}~F~e_1\cdots e_m} & =~~ \Eval{\kw{[}e'[v_1^1/x_1\cdots v_1^m/x_m],\cdots,e'[v_n^1/x_n\cdots v_n^m/x_m]\kw{]}} \\ & ~ ~~~~~\mathrm{where}~\lambda x_1\cdots x_m . e' = \extractF{F} \\ & ~ ~~~~~\mathrm{and}~ \kw{[}v_1^i,\cdots,v_n^i\kw{]} = \Eval{e_i} ~~~ i=[1..m]\end{split}\]

Given a SOAC function parameter $F$, we define the utility extraction function, $\extractF{F}$, as follows:

\[\begin{split}\extractF{\lam{x_1\cdots x_n}{e}} & =~~ \lambda x_1 \cdots x_n . e \\ \extractF{\id{binop}~e} & =~~ \lambda x . x~\id{binop}~v \\ & ~ ~~~~~\mathrm{where}~v = \Eval{e} \\ \extractF{e~\id{binop}} & =~~ \lambda x . v~\id{binop}~x \\ & ~ ~~~~~\mathrm{where}~v = \Eval{e}\end{split}\]

Type soundness is expressed by the following proposition:

Proposition: Futhark Type Soundness

If $~\vd P : \Gamma$ and $\Gamma(\id{main}) = () \rightarrow \tau$ and $\Eval{\id{main}~\kw{()}} = v$ then $~\vd v : \tau$.

Notice that we have glanced over the concept of bounds checking by assuming that arrays with elements of type $\tau$ are implemented as total functions from $N$ to $\Eval{\tau}$.

6.4. Work and Span¶

In this section we give a cost model for Futhark in terms of functions for determining the total work done by a program, in terms of operations done by the big-step evaluation semantics, and the span of the program execution, in terms of the maximum depth of the computation, assuming an infinite amount of parallelism in the SOAC computations. The functions for work and span, denoted by $W : \mathrm{Exp} \rightarrow \N$ and $S : \mathrm{Exp} \rightarrow \N$ are given below. The functions are defined independently, although they make use of the evaluation function $\Eval{\cdot}$. We have given the definitions for the essential SOAC functions, namely map and reduce. The definitions for the remaining SOACs follow the same lines as the definitions for map and reduce.

Work ($W$)

\[\begin{split}W(v) &=~~ 1 \\ W(\fw{let}~x~\kw{=}~e~\fw{in}~e') &=~~ W(e) + W(e'[\Eval{e}/x]) + 1 \\ W(\fw{let}~(x_1,...,x_n)~\kw{=}~e~\fw{in}~e') &=~~ \Let~[v_1,...,v_n] = \Eval{e} \\ & ~ ~~ \In~W(e) + W(e'[v_1/x_1,\cdots,v_n/x_n]) + 1 \\ W(\kw{[}e_1,\cdots,e_n\kw{]}) &=~~ W(e_1) + \ldots + W(e_n) + 1 \\ W(\kw{(}e_1,\cdots,e_n\kw{)}) &=~~ W(e_1) + \ldots + W(e_n) + 1 \\ W(f~e_1 \cdots e_n) &=~~ W(e_1) + \ldots + W(e_n) + W(e[\Eval{e_1}/x_1,\cdots \Eval{e_n}/x_n]) + 1 \\ & ~ ~~ \quad \mathrm{where} (\fw{let}~f~x_1~\cdots~x_n~=~e) ~\in~ P \\ W(e_1 \id{binop} e_2) &=~~ W(e_1) + W(e_2) + 1 \\ W(\fop{map}~F~e) &=~~ \Let~[v_1,\cdots,v_n] = \Eval{e} \\ & ~ ~~ \quad \lambda x. e' = \extractF{F} \\ & ~ ~~ \In~W(e) + W(e'[v_1/x]) + \ldots + W(e'[v_n/x]) \\ W(\fop{reduce}~F~e'~e'') &=~~ \Let~[v_1,\cdots,v_n] = \Eval{e''} \\ & ~ ~~ \quad \lambda x~x'. e = \extractF{F} \\ & ~ ~~ \In~W(e') + W(e'') + W(e[v_1/x,v_n/x']) \times n + 1 \\ & ~ ~~ \quad \mathrm{assuming} ~ W(e[v_1/x,v_n/x']) ~\mathrm{indifferent~to} ~v_1~ \mathrm{and} ~v_n \\ W(\fop{iota}~e) &=~~ W(e) + n \quad \mathrm{where} ~ n = \Eval{e}\end{split}\]

Span ($S$)

\[\begin{split}S(v) &=~~ 1 \\ S(\fw{let}~x~=~e~\fw{in}~e') &=~~ S(e) + S(e'[\Eval{e}/x]) + 1 \\ S(\fw{let}~(x_1,...,x_n)~=~e~\fw{in}~e') &=~~ \Let~[v_1,...,v_n] = \Eval{e} \\ & ~ ~~ \In~S(e) + S(e'[v_1/x_1,\cdots,v_n/x_n]) \\ S([e_1,\cdots,e_n]) &=~~ S(e_1) + \ldots + S(e_n) + 1 \\ S((e_1,\cdots,e_n)) &=~~ S(e_1) + \ldots + S(e_n) + 1 \\ S(f e_1 \cdots e_n) &=~~ S(e_1) + \ldots + S(e_n) + S(e[\Eval{e_1}/x_1,\cdots \Eval{e_n}/x_n]) + 1 \\ & ~ ~~ \quad \mathrm{where}~(\fw{let}~f~x_1~\cdots~x_n~=~e) ~\in~ P \\ S(e_1~\id{binop}~e_2) &=~~ S(e_1) + S(e_2) + 1 \\ S(\fop{map}~F~e) &=~~ \Let~[v_1,\cdots,v_n] = \Eval{e} \\ &~ ~~ \quad \lambda x. e' = \extractF{F} \\ &~ ~~ \In~S(e) + \id{max}(S(e'[v_1/x]), \ldots , S(e'[v_n/x])) + 1 \\ S(\fop{reduce}~F~e'~e'') &=~~ \Let~[v_1,\cdots,v_n] = \Eval{e''} \\ & ~ ~~ \quad \lambda x~x'. e = \extractF{F} \\ & ~ ~~ \In~S(e') + S(e'') + W(e[v_1/x,v_n/x']) \times \mathrm{ln}\,n + 1 \\ & ~ ~~ \quad \mathrm{assuming} ~ W(e[v_1/x,v_n/x']) ~\mathrm{indifferent~to} ~v_1~ \mathrm{and} ~v_n \\ S(\fop{iota}~ e) &=~~ S(e) + 1\end{split}\]

Note how the rule for reduce makes it explicit that Futhark does not promise to exploit any parallelism inside the reduction operator.

6.5. Reduction by Contraction¶

In this section, we shall investigate an implementation of reduction using the general concept of contraction, which is the general algorithmic trick of solving a particular problem by first making a contraction step, which simplifies the problem size, and then repeating the contraction algorithm until a final result is reached [Org16].

The reduction algorithm that we shall implement assumes an associative reduction operator $\oplus : A \rightarrow A \rightarrow A$, a neutral element of type $A$, and a vector $v$ of size $2^n$, containing elements of type $A$. If $\mathrm{size}(v) = 1$, the algorithm returns the single element. Otherwise, the algorithm performs a contraction by splitting the vector in two and applies the reduction operator elementwise on the two subvectors, thereby obtaining a contracted vector, which is then used as input to a recursive call to the algorithm. In Futhark, the function can be implemented as follows:

def red (xs : []i32) : i32 =
  let xs = loop xs=padpow2 0 xs while length xs > 1 do
    let n = length xs / 2
    in map2 (+) (take n xs) (take n (drop n xs))
  in xs[0]

The function specializes the reduction operator $\oplus$ to be + and the neutral element to be 0. The function first pads the argument vector xs with neutral elements to ensure that its size is a power of two. It then implements a sequential loop with the contraction step as its loop body, implemented by a parallel $\fop{map}$ over an appropriately split input vector.

The auxiliary function for padding the input vector is implemented by the following code:

-- Find the smallest power of two greater than n
def nextpow2 (n:i64) : i64 =
  loop a=2 while a < n do 2*a

-- Pad a vector to make its size a power of two
def padpow2 [n] (ne: i32) (v:[n]i32) : []i32 =
  concat v (replicate (nextpow2 n - n) ne)

6.5.1. Determining Work and Span¶

To determine the work and span of the algorithm red, we first determine the work and span for padpow2, for which we again need to determine the work and span for nextpow2. From simple inspection we have $W(\kw{nextpow2 n}) = S(\kw{nextpow2 n}) = O(\mathrm{log}\,\kw{n})$. Now, from the definition of $W$ and $S$ and because $\kw{nextpow2 n} \leq 2\,\kw{n}$, we have

\[W(\kw{padpow2 ne v}) = W(\fop{concat}~v~\kw{(}\fop{replicate}~\kw{(nextpow2 n - n) ne)}) = O(\kw{n})\]

and

\[S(\kw{padpow2 ne v}) = O(\log\,\kw{n})\]

where $\kw{n} = \size~\kw{v}$.

Each loop iteration in has span $O(1)$. Because the loop is iterated at-most $\log(2\,\kw{n})$ times, we have (where $\kw{n} = \size\,\kw{v}$)

\[\begin{split}W(\kw{red v}) & =~~ O(\kw{n}) + O(\kw{n/2}) + O(\kw{n/4}) + \cdots + O(1) ~~=~~ O(\kw{n}) \\ S(\kw{red v}) & =~~ O(\log\,\kw{n})\end{split}\]

It is an exercise for the reader to compare the performance of the reduction code to the performance of Futhark’s built-in reduce SOAC (see Section 3.2).

6.6. Radix-Sort by Contraction¶

Another example of a contraction-based algorithm is radix-sort. Radix-sort is a non-comparison based sorting routine, which implements sorting by iteratively moving elements with a particular bit set to the beginning (or end) in the array. It turns out that this move of elements with the same bit set can be parallelised. Thus, for arrays containing 32-bit unsigned integers, the sorting routine needs only 32 loop-iterations to sort the array. A central property of each step is that elements with identical bit values will not shift position. Depending on whether the algorithm consistently moves elements with the bit set to the end of the array or to the beginning of the array results in the array being sorted in either ascending or descending order.

6.6.1. Radix-Sort in Futhark¶

A radix-sort algorithm that sorts the argument vector in ascending order is shown below:

def rsort_step [n] (xs: [n]u32, bitn: i32): [n]u32 =
  let bits1 = map (\x -> (i64.u32 (x >> u32.i32 bitn)) & 1) xs
  let bits0 = map (1-) bits1
  let idxs0 = map2 (*) bits0 (scan (+) 0 bits0)
  let idxs1 = scan (+) 0 bits1
  let offs  = reduce (+) 0 bits0
  let idxs1 = map2 (*) bits1 (map (+offs) idxs1)
  let idxs  = map2 (+) idxs0 idxs1
  let idxs  = map (\x->x-1) idxs
  in scatter (copy xs) idxs xs

-- Radix sort algorithm, ascending
def rsort [n] (xs: [n]u32): [n]u32 =
  loop (xs) for i < 32 do rsort_step(xs,i)

The function rsort_step implements the contraction step that takes care of moving all elements with the bitn set to the end of the array. The main function rsort takes care of iterating the contraction step until the array is sorted (i.e., when the contraction step has been executed for all bits.) To appreciate the purpose of each data-parallel operation in the function, the table below illustrates how rsort_step takes care of moving elements with a particular bit set (bit 1) to the end of the array. The example assumes the current array (xs) contains the array [2,0,6,4,2,1,5,9]. Notice that the last three values all have their 0-bit set whereas the first five values have not. The values of xs marked with $\dagger$ are the ones with bit 1 set.

Variable
`xs`	$2^\dagger$	0	$2^\dagger$	4	$2^\dagger$	1	5	9
`bits1`	1	0	1	0	1	0	0	0
`bits0`	0	1	0	1	0	1	1	1
`scan (+) 0 bits0`	0	1	1	2	2	3	4	5
`idxs0`	0	1	0	2	0	3	4	5
`idxs1`	1	1	2	2	3	3	3	3
`idxs1'`	6	6	7	7	8	8	8	8
`idxs1''`	6	0	7	0	8	0	0	0
`idxs`	6	1	7	2	8	3	4	5
`map (-1) idxs`	5	0	6	1	7	2	3	4

By a straightforward analysis, we can argue that $W(\kw{rsort}~\kw{v}) = O(\kw{n})$, where $n = \length\,\kw{v}$; each of the operations in has work $O(\kw{n})$ and rsort_step is called a constant number of times (i.e., 32 times). Similarly, we can argue that $S(\kw{rsort}~\kw{v}) = O(\log\,\kw{n})$, dominated by the SOAC calls in rsort_step.

6.7. Counting Primes¶

A variant of a contraction algorithm is an algorithm that first solves a smaller problem, recursively, and then uses this result to provide a solution to the larger problem. One such algorithm is a version of the Sieve of Eratosthenes that, to find the primes smaller than some $n$, first calculates the primes smaller than $\sqrt n$. It then uses this intermediate result for sieving away the integers in the range $\sqrt n$ up to $n$ that are multiples of the primes smaller than $\sqrt n$.

Unfortunately, Futhark does not presently support recursion, thus, one needs to use a $\fw{loop}$ construct instead to implement the sieve. A Futhark program calculating the number of primes below some number $n$, also denoted in the literature as the $\pi$ function, is shown below:

-- Find the first n primes
def primes (n:i64) : []i64 =
  let (acc, _) = loop (acc,c) = ([],2) while c < n+1 do
	let c2 = i64.min (c * c) (n+1)
	let is = map (+c) (iota (c2-c))
	let fs = map (\i ->
		      let xs = map (\p -> if i%p==0 then 1
					  else 0) acc
		      in reduce (+) 0 xs) is
	-- apply the sieve
	let new = filter (\i -> 0 == fs[i-c]) is
	in (concat acc new, c2)
  in acc

-- Return the number of primes less than n
def main (n:i64) =
  length (primes n)

Notice that the algorithm applies a parallel sieve for each step, using a combination of maps and reductions. The best known sequential algorithm for finding the number of primes below some $n$ takes time $O(n\,\log\,\log\,n)$. Although the present algorithm is quite efficient in practice, it is not work effcient, since the work inside the loop is super-linear. The loop itself introduces a span with a $\log\,\log\,n$ factor because the size of the problem is squared at each step, which is identical to doubling the exponent size at each step (i.e., the sequence $2^2, 2^4, 2^8, 2^{16}, \ldots, n$, where $n=2^{2^m}$, for some positive $m$, has $m = \log\,\log\,n$ elements.)

In Section 10.4 we discuss the possibility of using a flattening approach to implement a work-efficient parallel Sieve-of-Erastothenes algorithm.