# 9. Irregular Flattening¶

In this chapter, we investigate a number of challenging irregular algorithms, which cannot be dealt with directly using Futhark’s moderate flattening technique discussed in Section 7.6.

## 9.1. Flattening by Expansion¶

For dealing with large non-regular problems, we need ways to
regularise the problems so that they become tractable with the regular
parallel techniques that we have seen demonstrated previously. One way
to regularise a problem is by *padding* data such that the data fits a
regular parallel schema. However, by doing so, we run the risk that
the program will use too many parallel resources for computations on
the padding data. This problem will arise, in particular, if the data
is very irregular. As a simple, and also visualisable, example,
consider the task of determining the points that make up a number of
line segments given by sets of two points in a 2D grid. Whereas we may
easily devise an algorithm for determining the grid points that make
up a single line segment, it is not immediately obvious how we can
efficiently regularise the problem of drawing multiple line segments,
as each line segment will end up being represented by a different
number of points. If we choose to implement a padding regularisation
scheme by introducing a notion of ‘’an empty point’’, each line can be
represented as the same number of points, which will allow us to map
over an array of such line points for processing the lines using
regular parallelism. However, the cost we pay is that even the
smallest line will be represented as the same number of points as the
longest line.

Another strategy for regularisation is to *flatten* the irregular
parallelism into regular parallelism and use segmented operations to
process each particular object. It turns out that there, in many
cases, is a simple approach to implement such flattening, using, as we
shall see, a technique called *expansion*, which will take care of all
the knitty gritty details of the flattening. The expansion approach is
centered around a function that we shall call `expand`

, which, as
the name suggests, expands a source array into a longer target array,
by expanding each individual source element into multiple target
elements, which can then be processed in parallel.

For implementing the `expand`

function using only parallel
operations, we shall make use of the segmented helper functions
defined in Section 7. In particular, we shall make use of
the functions `replicated_iota`

, `segmented_replicate`

, and
`segmented_iota`

.

Here is the generic type of the `expand`

function:

```
val expand 'a 'b : (sz: a -> i32) -> (get: a -> i32 -> b) -> []a -> []b
```

The function expands a source array into a target array given (1) a
function that determines, for each source element, how many target
elements it expands to and (2) a function that computes a particular
target element based on a source element and the target element number
associated with the source. As an example, the expression ```
expand
(\x->x) (*) [2,3,1]
```

returns the array `[0,2,0,3,6,0]`

. The
function is defined as follows:

```
def expand 'a 'b (sz: a -> i32) (get: a -> i32 -> b) (arr:[]a) : []b =
let szs = map sz arr
let idxs = replicated_iota szs
let iotas = segmented_iota (map2 (!=) idxs (rotate (i32.negate 1) idxs))
in map2 (\i j -> get arr[i] j) idxs iotas
```

## 9.2. Drawing Lines¶

In this section we demonstrate how to apply the
flattening-by-expansion technique for obtaining a work efficient line
drawing routine that draws lines fully in parallel. The technique
resembles the development by Blelloch [Ble90] with
the difference that it makes use of the `expand`

function defined in
the previous section. Given a number of line segments, each defined by its
end points \((x_1,y_1)\) and \((x_2,y_2)\), the algorithm will
find the set of all points constituting all the line segments.

We first present an algorithm that will find all points that constitutes a single line segment. For computing this set, observe that the number of points that make up the constituting set is the maximum of \(|x_2-x_1|\) and \(|y_2-y_1|\), the absolute values of the difference in \(x\)-coordinates and \(y\)-coordinates, respectively. Using this observation, the algorithm can independently compute the constituting set by first calculating the proper direction and slope of a line, relative to a particular starting point.

The simple line drawing routine is given as follows:

```
-- Finding points on a line
type point = (i32,i32)
type line = (point,point)
type points [n] = [n]point
def compare (v1:i32) (v2:i32) : i32 =
if v2 > v1 then 1 else if v1 > v2 then -1 else 0
def slope ((x1,y1):point) ((x2,y2):point) : f32 =
if x2==x1 then if y2>y1 then 1f32 else -1f32
else r32(y2-y1) / f32.abs(r32(x2-x1))
def linepoints ((x1,y1):point, (x2,y2):point) : points [] =
let len = 1 + i32.max (i32.abs(x2-x1)) (i32.abs(y2-y1))
let xmax = i32.abs(x2-x1) > i32.abs(y2-y1)
let (dir,sl) =
if xmax then (compare x1 x2, slope (x1,y1) (x2,y2))
else (compare y1 y2, slope (y1,x1) (y2,x2))
in map (\i -> if xmax
then (x1+i*dir,
y1+i32.f32(f32.round(sl*r32(i))))
else (x1+i32.f32(f32.round(sl*r32(i))),
y1+i*dir)) (iota len)
```

Futhark code that uses the `linepoints`

function for drawing
concrete lines is shown below:

```
-- Write to grid
def update [h] [w] [n] (grid: [h][w]i32)(xs:[n]i32)(ys:[n]i32): [h][w]i32 =
let is = map2 (\x y -> w*y+x) xs ys
let flatgrid = flatten grid
let ones = map (\ _ -> 1) is
in unflatten h w (scatter (copy flatgrid) is ones)
-- Sequential algorithm for drawing multiple lines
def drawlines [h] [w] [n] (grid: *[h][w]i32) (lines:[n]line) : [h][w]i32 =
loop (grid) for i < n do -- find points for line i
let (xs,ys) = unzip (linepoints (lines[i]))
in update grid xs ys
-- Draw lines on a 70 by 30 grid
def main : [][]i32 =
let height:i32 = 30
let width:i32 = 70
let grid : *[][]i32 = replicate height (replicate width 0)
let lines = [((58,20),(2,3)),((27,3),(2,28)),((5,20),(20,20)),
((4,10),(6,25)),((26,25),(26,2)),((58,20),(52,3))]
in drawlines grid lines
```

The function `main`

sets up a grid and calls the function
`drawlines`

, which takes care of sequentially updating the grid with
constituting points for each line, computed using the `linepoints`

function. The resulting points look like this:

An unfortunate problem with the line drawing routine shown above is
that it draws the lines sequentially, one by one, and therefore makes
only very limited use of a GPU’s parallel cores. There are various
ways one may mitigate this problem. One way could be to use `map`

to
draw lines in parallel. However, such an approach will require some
kind of padding to ensure that the map function will compute data of
the same length, no matter the length of the line. A more resource
aware approach will apply a flattening technique for computing all
points defined by all lines simultaneously. Using the `expand`

function defined in the previous section, all we need to do to
implement this approach is to provide (1) a function that determines
for a given line, the number of points that make up the line and (2) a
function that determines the `n`

’th point of a particular line, given
the index `n`

. The code for such an approach looks as follows:

```
-- Parallel flattened algorithm for turning lines into
-- points, using expansion.
def points_in_line ((x1,y1),(x2,y2)) =
i32.(1 + max (abs(x2-x1)) (abs(y2-y1)))
def get_point_in_line ((p1,p2):line) (i:i32) =
if i32.abs(p1.0-p2.0) > i32.abs(p1.1-p2.1)
then let dir = compare (p1.0) (p2.0)
let sl = slope p1 p2
in (p1.0+dir*i,
p1.1+i32.f32(f32.round(sl*r32 i)))
else let dir = compare (p1.1) (p2.1)
let sl = slope (p1.1,p1.0) (p2.1,p2.0)
in (p1.0+i32.f32(f32.round(sl*r32 i)),
p1.1+i*dir)
def drawlines [h][w][n] (grid:*[h][w]i32)
(lines:[n]line) :[h][w]i32 =
let (xs,ys) = expand points_in_line get_point_in_line lines
|> unzip
in update grid xs ys
```

Notice that the function `get_point_in_line`

distinguishes between
whether the number of points in the line is counted by the x-axis or
the y-axis. Notice also that the flattening technique can be applied
only because all lines have the same color. Otherwise, when two lines
intersect, the result would be undefined, due to the fact that
`scatter`

results in undefined behaviour when multiple values are
written into the same location of an array.

## 9.3. Drawing Triangles¶

Another example of an algorithm worthy of flattening is an algorithm
for drawing triangles. The algorithm that we present here is based on
the assumption that we already have a function for drawing multiple
horizontal lines in parallel. Luckily, we have such a function! The
algorithm is based on the property that any triangle can be split into
an *upper triangle* with a horizontal baseline and a *lower triangle*
with a horizontal ceiling. Just as the algorithm for drawing lines
makes use of the `expand`

function defined earlier, so will the
flattened algorithm for drawing triangles. A triangle is defined by
the three points representing the corners of the triangle:

```
type triangle = (point, point, point)
```

We shall make the assumption that the three points that define the triangle have already been sorted according to the y-axis. Thus, we can assume that the first point is the top point, the third point is the lowest point, and the second point is the middle point (according to the y-axis).

The first function we need to pass to the `expand`

function is a
function that determines the number of horizontal lines in the triangle:

```
def lines_in_triangle ((p,_,r):triangle) : i32 =
r.1 - p.1 + 1
```

The second function we need to pass to the `expand`

function is
somewhat more involved. We first define a function `dxdy`

, which
computes the inverse slope of a line between two points:

```
def dxdy (a:point) (b:point) : f32 =
let dx = b.0 - a.0
let dy = b.1 - a.1
in if dy == 0 then f32.i32 0
else f32.i32 dx f32./ f32.i32 dy
```

We can now define the function that, given a triangle and the horizontal line number in the triangle (counted from the top), returns the corresponding line:

```
def get_line_in_triangle ((p,q,r):triangle) (i:i32) =
let y = p.1 + i
in if i <= q.1 - p.1 then -- upper half
let sl1 = dxdy p q
let sl2 = dxdy p r
let x1 = p.0 + i32.f32(f32.round(sl1 * f32.i32 i))
let x2 = p.0 + i32.f32(f32.round(sl2 * f32.i32 i))
in ((x1,y),(x2,y))
else -- lower half
let sl1 = dxdy r p
let sl2 = dxdy r q
let dy = (r.1 - p.1) - i
let x1 = r.0 - i32.f32(f32.round(sl1 * f32.i32 dy))
let x2 = r.0 - i32.f32(f32.round(sl2 * f32.i32 dy))
in ((x1,y),(x2,y))
```

The function distinguishes between whether the line to compute resides in the upper or the lower subtriangle. Finally, we can define a parallel, work-efficient function that converts a number of triangles into lines:

```
def lines_of_triangles (xs:[]triangle) : []line =
expand lines_in_triangle get_line_in_triangle
(map normalize xs)
def draw (height:i32) (width:i32) : [][]i32 =
```

To see the code in action, here is a function that draws three triangles on a grid of height 30 and width 62:

```
def draw (height:i32) (width:i32) : [][]i32 =
let grid : *[][]i32 = replicate height (replicate width 0)
let triangles = [((5,10),(2,28),(18,20)),
((42,6),(58,10),(25,22)),
((8,3),(15,15),(35,7))]
let lines = lines_of_triangles triangles
in drawlines grid lines
```

The function makes use of both the `lines_of_triangles`

function
that we have defined here and the work efficient `drawlines`

function defined previously. Here is a plot of the result:

## 9.4. Primes by Expansion¶

We saw earlier in Section 5.7 how we could implement a
parallel algorithm for finding the number of primes below a given
number. We also found, however, that the algorithm presented was not
work-efficient. It is possible to implement a work-efficient algorithm
using the `expand`

function. We will leave the task as an exercise
for the reader.

## 9.5. Complex Flattening¶

Unfortunately, the flattening-by-expansion technique does not suit all
irregular problems. We shall now investigate how we can flatten a
highly irregular algorithm such as quick-sort. The Quick-sort
algorithm can be presented very elegantly in a functional
language. The function `qsort`

that we will define has the following
type:

```
val qsort 't [n] : (t -> t -> bool) -> [n]t -> [n]t
```

Given a comparison function (`<=`

) and an array of elements `xs`

,
`qsort (<=) xs`

returns an array with the elements in `xs`

sorted
according to `<=`

. Consider the following pseudo-code, which,
unfortunately, is not immediately Futhark code:

```
def qsort (<=) xs =
if length xs < 2 then xs
else let (left,middle,right) = partition (<=) xs[length xs / 2] xs
in qsort (<=) left ++ middle ++ qsort (<=) right
```

Here the function `partition`

returns three arrays with the first
array containing elements smaller than the *pivot* element ```
xs[length xs
/ 2]
```

, the second array containing elements equal to the pivot
element, and the third array containing elements that are greater than
the pivot element. There are multiple problems with this code. First,
the code makes use of recursion, which is not supported by
Futhark. Second, the kind of recursion used is not tail-recursion,
which means that it is not directly obvious how to eliminate the
recursion. Third, it is not clear how the code can avoid using an
excessive amount of memory instead of making use of inplace-updates
for the sorting. Finally, it seems that the code is inherently
task-parallel in nature and not particularly data-parallel.

The solution is to solve a slightly more general problem. More
precisely, we shall set out to sort a number of segments,
simultaneously, where each segment comprises a part of the
array. Notice that we are interested in supporting a notion of
*partial segmentation*, for which the segments of interest are
disjoint but do not necessarily together span the entire array. In
particular, the algorithm does not need to sort segments containing
previously chosen pivot values. Such segments are already located in
the correct positions, which means that they need not be moved around
by the segmented quick sort implementation.

We first define a type `sgm`

that specifies a segment of an
underlying one-dimensional array of values:

```
type sgm = {start:i32,sz:i32} -- segment descriptor
```

At top-level, the function `qsort`

is defined as follows, assuming a
function `step`

of type ```
(t -> t -> bool) -> *[n]t -> []sgm ->
(*[n]t,[]sgm)
```

:

```
def qsort [n] 't ((<=): t -> t -> bool) (xs:[n]t) : [n]t =
if n < 2 then xs
else (loop (xs,mms) = (copy xs,[{start=0,sz=n}])
while length mms > 0 do
step (<=) xs mms).0
```

The `step`

function is called initially with the array to be sorted
as argument together with a singleton array containing a segment
denoting the entire array to be sorted. The `step`

function is
called iteratively until the returned array of segments is empty. The
job of the `step`

function is to divide each segment into three new
segments based on pivot values found for each segment. After the step
function has reordered the values in the segments, the middle segment
(containing values equal to a pivot) need not be dealt with again in
the further process. A new array of segment descriptors is then
defined and after removing empty segment descriptors, the resulting
array of non-empty segment descriptors is returned by the `step`

function together with the reordered value array.

Before we can define the `step`

function, we first define a few
helper functions. Using the functions `segmented_iota`

and
`segmented_replicate`

, defined earlier, we can define a function for
finding all the indexes represented by an array of segments:

```
def idxs_values (sgms:[]sgm) : []i32 =
let sgms_szs : []i32 = map (\sgm -> sgm.sz) sgms
let is1 = segmented_replicate sgms_szs (map (\x -> x.start) sgms)
let fs = map2 (!=) is1 (rotate (i32.negate 1) is1)
let is2 = segmented_iota fs
in map2 (+) is1 is2
```

We also define a function `info`

that, given an ordering function
and two elements, returns `-1`

if the first element is less than the
second element, `0`

if the elements are identical, and `1`

if the
first element is greater than the second element:

```
def info 't ((<=): t -> t -> bool) (x:t) (y:t) : i32 =
if x <= y then if y <= x then 0 else -1
else 1
```

The following two functions `tripit`

and `tripadd`

are used for
converting the classification of elements into subsegments:

```
def tripit x = if x < 0 then (1,0,0)
else if x > 0 then (0,0,1) else (0,1,0)
def tripadd (a1:i32,e1:i32,b1:i32) (a2,e2,b2) =
(a1+a2,e1+e2,b1+b2)
```

We can now define the function `step`

that, besides from an ordering
function, takes as arguments (1) the array containing values and (2) an
array of segments to be sorted. The function returns a pair of a
reordered array of values and a new array of segments to be
sorted:

```
def step [n][m] 't ((<=): t -> t -> bool) (xs:*[n]t) (sgms:[m]sgm)
: (*[n]t,[]sgm) =
-- find a pivot for each segment
let pivots : []t = map (\sgm -> xs[sgm.start + sgm.sz/2]) sgms
-- find index into the segment that a value belongs to
let k = i32.sum (map (.sz) sgms)
let idxs = replicated_iota (map (\sgm -> sgm.sz) sgms) :> [k]i32
let is = idxs_values sgms :> [k]i32
-- for each value, how does it compare to the pivot associated
-- with the segment?
let infos =
map2 (\idx i -> info (<=) xs[i] pivots[idx]) idxs is
let orders : [](i32,i32,i32) = map tripit infos
-- compute segment descriptor
let flags =
[true] ++ (map2 (!=) idxs (rotate (i32.negate 1) idxs))[1:] :> [k]bool
-- compute partition sizes for each segment
let pszs =
segmented_reduce tripadd (0,0,0) flags orders :> [m](i32,i32,i32)
-- compute the new segments
let sgms' =
map2 (\(sgm:sgm) (a,e,b) -> [{start=sgm.start,sz=a},
{start=sgm.start+a+e,sz=b}]) sgms pszs
|> flatten
|> filter (\sgm -> sgm.sz > 1)
-- compute the new positions of the values in the present segments
let newpos : []i32 =
let where : [](i32,i32,i32) =
segmented_scan tripadd (0,0,0) flags orders
in map3 (\i (a,e,b) info ->
let (x,y,_) = pszs[i]
let s = sgms[i].start
in if info < 0 then s+a-1
else if info > 0 then s+b-1+x+y
else s+e-1+x) idxs where infos
let vs = map (\i -> xs[i]) is
let xs' = scatter xs newpos vs
in (xs',sgms')
```

The algorithm has best case work complexity \(O(n)\) (when all elements are identical), worst case work complexity \(O(n^2)\), and an average case work complexity of \(O(n \log n)\). It has best depth complexity \(O(1)\), worst depth complexity \(O(n)\) and average depth complexity \(O(\log n)\).