8. Regular Flattening

In this chapter, we introduce the concept of regular moderate flattening [HSE+17], which is the essential technique used for making regular nested parallel Futhark programs run efficiently in practice on parallel hardware such as GPUs.

We first introduce a number of parallel segmented operations, which are essential for dealing with nested parallelism. The segmented operations, it turns out, can be implemented using Futhark’s standard SOAC parallel array combinators. In particular, it turns out that the scan operator is of critical importance in that it can be used to develop the notion of a segmented scan operation, an operation that, in its own right, is essential to many parallel algorithms. Based on the segmented scan operation and the other Futhark SOAC operations, we present a set of utility functions as well as their parallel implementations. The functions are used by the moderate flattening transformation presented in Section 8.6, but are also useful, as we shall see in Section 10, for the programmer to manage irregular parallelism through flattening transformations, performed manually by the programmer.

8.1. Segmented Scan

As mentioned, the segmented scan operation is quite essential for Futhark to flatten nested regular parallelism and for the programmer to flatten irregular nested parallel problems. The operation can be implemented with a simple scan using an associative function that operates on pairs of values [Ble90, Sch80]. Here is the definition of the segmented scan operation, hardcoded to work with addition:

-- Segmented scan with integer addition
def segmented_scan_add [n] (flags:[n]bool) (vals:[n]i32) : [n]i32 =
  let pairs = scan ( \(v1,f1) (v2,f2) ->
                       let f = f1 || f2
                       let v = if f2 then v2 else v1+v2
                       in (v,f) ) (0,false) (zip vals flags)
  let (res,_) = unzip pairs
  in res

We can make use of Futhark’s support for higher-order functions and polymorphism to define a generic version of segmented scan that will work for other monoidal structures than addition on i32 values:

def segmented_scan 't [n] (g:t->t->t) (ne: t) (flags: [n]bool) (vals: [n]t): [n]t =
  let pairs = scan ( \ (v1,f1) (v2,f2) ->
                       let f = f1 || f2
                       let v = if f2 then v2 else g v1 v2
                       in (v,f) ) (ne,false) (zip vals flags)
  let (res,_) = unzip pairs
  in res

We leave it up to the reader to prove that, given an associative function g, (1) the operator passed to scan is associative and (2) (ne, false) is a neutral element for the operator.

8.1.1. Finding the Longest Streak Using Segmented Scan

In this section we revisit the problem of Section 7.4 for finding the longest streak of increasing numbers. We show how we can make direct use of a segmented scan operation for solving the problem:

-- Longest streak of increasing numbers
def segmented_streak [n] (xs: [n]i32) : i32  =
  let ys = rotate 1 xs
  let is = (map2 (\x y -> if x < y then 1 else 0) xs ys)[0:n-1]
  let fs = map (==0) is
  let ss = segmented_scan_add fs is
  let res = reduce max 0 ss
  in res

The algorithm first constructs the is array, as in the previous algorithm, and then uses a segmented scan over a negation of this array over the unit-array to create the ss3 vector directly. Here is a derivation of how the segmented-scan based algorithm works:

Variable

xs

=

1

5

3

4

2

6

7

8

ys

=

5

3

4

2

6

7

8

1

is

=

1

0

1

0

1

1

1

fs

=

0

1

0

1

0

0

0

ss

=

1

0

1

0

1

2

3

res

=

3

The morale here is that the segmented scan operation provides us with a great abstraction.

8.2. Replicated Iota

The first utility function that we will present is called replicated_iota. Given an array of natural numbers specifying repetitions, the function returns an array of weakly increasing indices (starting from 0) and with each index repeated according to the repetition array. As an example, replicated_iota [2,3,1,1] returns the array [0,0,1,1,1,2,3]. The function is defined in terms of other parallel operations, including scan, map, scatter, and segmented_scan:

def replicated_iota [n] (reps:[n]i64) : []i64 =
  let s1 = scan (+) 0 reps
  let s2 = map (\i -> if i==0 then 0 else s1[i-1]) (iota n)
  let tmp = scatter (replicate (reduce (+) 0 reps) 0) s2 (iota n)
  let flags = map (>0) tmp
  in segmented_scan (+) 0 flags tmp

An example evaluation of a call to the function replicated_iota is provided below.

Args/Result

reps

=

2

3

1

1

s1

=

2

5

6

7

s2

=

0

2

5

6

replicate

=

0

0

0

0

0

0

0

tmp

=

0

0

1

0

0

2

3

flags

=

0

0

1

0

0

1

1

segmented_scan

=

0

0

1

1

1

2

3

8.3. Segmented Replicate

Another useful utility function is called segmented_replicate. Given a one-dimensional replication array containing natural numbers and a data array of the same shape, segmented_replicate returns an array of size equal to the sum of the values in the replication array with values from the data array replicated according to the corresponding replication values. As an example, a call segmented_replicate [2,1,0,3,0] [5,6,9,8,4] result in the array [5,5,6,8,8,8]. Here is the code that implements the function segmented_replicate:

def segmented_replicate [n] (reps:[n]i64) (vs:[n]i64) : []i64 =
  let idxs = replicated_iota reps
  in map (\i -> vs[i]) idxs

The segmented_replicate function makes use of the previously defined function replicated_iota.

8.4. Segmented Iota

Another useful utility function is the function segmented_iota that, given a array of flags (i.e., booleans), returns an array of index sequences, each of which is reset according to the booleans in the array of flags. As an example, the expression:

segmented_iota [false,false,false,true,false,false,false]

returns the array [0,1,2,0,1,2,3]. The segmented_iota function can be implemented with the use of a simple call to segmented_scan followed by a call to map:

def segmented_iota [n] (flags:[n]bool) : [n]i64 =
  let iotas = segmented_scan (+) 0 flags (replicate n 1)
  in map (\x -> x-1) iotas

8.5. Indexes to Flags

Many segmented operations, such as segmented_scan takes as argument an array of boolean flags for specifying when new segments start. Often, only the sizes of segments are known, which means that it may come in useful to be able to transform an array of segment sizes to a corresponding array of boolean flags. Here is one possible parallel implementation of such an idxs_to_flags function:

def idxs_to_flags [n] (is : [n]i64) : []bool =
  let vs = segmented_replicate is (iota n)
  let m = length vs
  in map2 (!=) (vs :> [m]i64) ([0] ++ vs[:m-1] :> [m]i64)

As an example use of the function, the expression idxs_to_flags [2,1,3] evaluates to the flag array [false,false,true,true,false,false]. Notice that the implementation also works in case some segments are of size zero.

8.6. Moderate Flattening

The flattening rules that we shall introduce here allow the Futhark compiler to generate parallel kernels for various code block patterns. In contrast to the general concept of flattening as introduced by Blelloch [BHS+94], Futhark applies a technique called moderate flattening [HSE+17], which does not cover arbitrary nested parallelism, but does cover well many regular nested parallel patterns. We shall come back to the issue of flattening irregular nested parallelism in Section 10.

In essence, moderate flattening works by matching compositions of fused constructs against a number of flattening rules. The aim is to merge (i.e., flatten) nested parallel operations into sequences of parallel operations. Although, such flattening is often possible, in particular due to an integrated transformation called vectorisation, there are situations where choices needs to be made. In particular, when a map is nested on top of a loop, we may choose to parallelise the outer map and sequentialise the inner loop, which on the GPU will amount to all threads running sequential loops in parallel. An alternative, when possible, will be to interchange the outer map and the loop and then sequentialise the outer loop (on the host) and parallelise the inner map, which will then be executed multiple times. It turns out that Futhark can make some guesses about which strategy to pursue based on possible information about the sizes of the arrays. An extension to the static concept moderate flattening, Futhark also supports a notion of flattening that generates multiple versions of flattened code, guarded by parameters that may be autotuned to achieve good performance for a range of different data sets [HTEO19].

In the following we shall focus on the transformations performed by moderate flattening.

8.6.1. Vectorisation

Assuming e' contains SOACs, transform the expression

map (\x -> let y = e in e') xs

into the expression

let ys = map (\x -> e) xs
in map (\(x,y) -> e') (zip xs ys)

This transformation does not itself capture any nested parallelism but may enable other transformations by eliminating the inner let-expression.

8.6.2. Map-Map Nesting

Nested applications of map constructs are in essence transformed into a single map construct by (1) flattening the argument array, (2) applying the inner function on the flattened array, and (3) unflattening the concatenated results. This process can be repeated for multiple nested map constructs. It turns out that the administrative operations can be implemented with zero overhead.

8.6.3. Map-Scan Nesting

In case of an expression made up from a map construct appearing on top of a scan operation, the expression is transformed into a regular segmented scan operation. That is, the expression:

map (\xs -> scan f ne xs) xss

is transformed into the expression:

regular_segmented_scan f ne xss

Notice here that we assume the availability of a regular segmented scan operation of type:

val regular_segmented_scan 't [n] [m]: (t->t->t) -> t -> [n][m]t -> [n][m]t

Internally, this function will use the inner size of the multi-dimensional argument array (i.e., m) to construct an appropriate flag vector suitable for the segmented scan. Again, for an in-depth discussion of how to implement a segmented scan operation on top of an ordinary scan operation, please consult Section 8.1.

8.6.4. Map-Reduce Nesting

In case of a map construct appearing on top of a reduce operation, this expression is transformed into a regular segmented reduction [LH17]. That is, the expression:

map (\xs -> reduce f ne xs) xss

is transformed into the expression:

regular_segmented_reduce f ne xss

Notice here that we assume the availability of a regular segmented reduction operation of type:

val regular_segmented_reduce 't [n] : (t->t->t) -> t -> [n][]t -> [n]t

Internally, this function can be implemented based on the function regular_segmented_scan discussed above. Here is a simple definition::

def regular_segmented_reduce = map last <-< regular_segmented_scan

8.6.5. Map-Iota Nesting

A map over an iota expression can be transformed to the composition of the segmented_iota function defined in Section 8.4 and a function ìdxs_to_flags, which converts an array of indices to an array fs of boolean flags of size equal to the sum of the values in xs and with true-values in indexes specified by the prefix sums of the index values.

As an example, the expression idxs_to_flags [2,1,3] evaluates to the flag array [false,false,true,true,false,false]. Notice that the expression idxs_to_flags [2,0,4] evaluates to the same boolean vector as idxs_to_flags [2,4]. We shall not here give a definition of the idxs_to_flags function, but refer the reader to Section 8.5.

All in all, an expression of the form:

map iota xs

is transformed into:

(segmented_iota <-< idxs_to_flags) xs

8.6.6. Map-Replicate Nesting

Recall that replicate has the type:

val replicate 't : (n:i32) -> t -> [n]t

A map over a replicate expression takes the form:

map (\x -> replicate n x) xs

where n is invariant to x. Such an expression can be transformed into the expression:

segmented_replicate (replicate (length xs) n) xs

As an example, consider the expression map (replicate 2) [8,5,1]. This expression is transformed into the expression:

segmented_replicate (replicate 3 2) [8,5,1]

which evaluates to [8,8,5,5,1,1]. Notice that the subexpression replicate 3 2 evaluates to [2,2,2].