10. Irregular Flattening

In this chapter, we investigate a number of challenging irregular algorithms, which cannot be dealt with directly using Futhark’s moderate flattening technique discussed in Section 8.6.

10.1. Flattening by Expansion

For dealing with large non-regular problems, we need ways to regularise the problems so that they become tractable with the regular parallel techniques that we have seen demonstrated previously. One way to regularise a problem is by padding data such that the data fits a regular parallel schema. However, by doing so, we run the risk that the program will use too many parallel resources for computations on the padding data. This problem will arise, in particular, if the data is very irregular. As a simple, and also visualisable, example, consider the task of determining the points that make up a number of line segments given by sets of two points in a 2D grid. Whereas we may easily devise an algorithm for determining the grid points that make up a single line segment, it is not immediately obvious how we can efficiently regularise the problem of drawing multiple line segments, as each line segment will end up being represented by a different number of points. If we choose to implement a padding regularisation scheme by introducing a notion of ‘’an empty point’’, each line can be represented as the same number of points, which will allow us to map over an array of such line points for processing the lines using regular parallelism. However, the cost we pay is that even the smallest line will be represented as the same number of points as the longest line.

Another strategy for regularisation is to flatten the irregular parallelism into regular parallelism and use segmented operations to process each particular object. It turns out that there, in many cases, is a simple approach to implement such flattening, using, as we shall see, a technique called expansion, which will take care of all the knitty gritty details of the flattening. The expansion approach is centered around a function that we shall call expand, which, as the name suggests, expands a source array into a longer target array, by expanding each individual source element into multiple target elements, which can then be processed in parallel.

For implementing the expand function using only parallel operations, we shall make use of the segmented helper functions defined in Section 8. In particular, we shall make use of the functions replicated_iota, segmented_replicate, and segmented_iota.

Here is the generic type of the expand function:

val expand 'a 'b : (sz: a -> i32) -> (get: a -> i32 -> b) -> []a -> []b

The function expands a source array into a target array given (1) a function that determines, for each source element, how many target elements it expands to and (2) a function that computes a particular target element based on a source element and the target element number associated with the source. As an example, the expression expand (\x->x) (*) [2,3,1] returns the array [0,2,0,3,6,0]. The function is defined as follows:

def expand 'a 'b (sz: a -> i64) (get: a -> i64 -> b) (arr:[]a) : []b =
  let szs = map sz arr
  let idxs = replicated_iota szs
  let iotas = segmented_iota (map2 (!=) idxs (rotate (i64.neg 1) idxs))
  in map2 (\i j -> get arr[i] j) idxs iotas

10.2. Drawing Lines

In this section we demonstrate how to apply the flattening-by-expansion technique for obtaining a work efficient line drawing routine that draws lines fully in parallel. The technique resembles the development by Blelloch [Ble90] with the difference that it makes use of the expand function defined in the previous section. Given a number of line segments, each defined by its end points \((x_1,y_1)\) and \((x_2,y_2)\), the algorithm will find the set of all points constituting all the line segments.

We first present an algorithm that will find all points that constitutes a single line segment. For computing this set, observe that the number of points that make up the constituting set is the maximum of \(|x_2-x_1|\) and \(|y_2-y_1|\), the absolute values of the difference in \(x\)-coordinates and \(y\)-coordinates, respectively. Using this observation, the algorithm can independently compute the constituting set by first calculating the proper direction and slope of a line, relative to a particular starting point.

The simple line drawing routine is given as follows:

-- Finding points on a line
type point = (i64,i64)
type line = (point,point)
type points [n] = [n]point

def compare (v1:i64) (v2:i64) : i64 =
  if v2 > v1 then 1 else if v1 > v2 then -1 else 0

def slope ((x1,y1):point) ((x2,y2):point) : f32 =
  if x2==x1 then if y2>y1 then 1f32 else -1f32
                 else f32.i64(y2-y1) / f32.abs(f32.i64(x2-x1))

def linepoints ((x1,y1):point, (x2,y2):point) : points [] =
  let len = 1 + i64.max (i64.abs(x2-x1)) (i64.abs(y2-y1))
  let xmax = i64.abs(x2-x1) > i64.abs(y2-y1)
  let (dir,sl) =
    if xmax then (compare x1 x2, slope (x1,y1) (x2,y2))
    else (compare y1 y2, slope (y1,x1) (y2,x2))
  in map (\i -> if xmax
                then (x1+i*dir,
                else (x1+i64.f32(f32.round(sl*f32.i64(i))),
                      y1+i*dir)) (iota len)

Futhark code that uses the linepoints function for drawing concrete lines is shown below:

-- Write to grid
def update [h] [w] [n] (grid: [h][w]i64)(xs:[n]i64)(ys:[n]i64): [h][w]i64 =
  let is = map2 (\x y -> w*y+x) xs ys
  let flatgrid = flatten grid
  let ones = map (\ _ -> 1) is
  in unflatten (scatter (copy flatgrid) is ones)

-- Sequential algorithm for drawing multiple lines
def drawlines [h] [w] [n] (grid: *[h][w]i64) (lines:[n]line) : [h][w]i64 =
  loop (grid) for i < n do -- find points for line i
    let (xs,ys) = unzip (linepoints (lines[i]))
    in update grid xs ys

-- Draw lines on a 70 by 30 grid
def main : [][]i64 =
  let height:i64 = 30
  let width:i64 = 70
  let grid : *[][]i64 = replicate height (replicate width 0)
  let lines = [((58,20),(2,3)),((27,3),(2,28)),((5,20),(20,20)),
  in drawlines grid lines

The function main sets up a grid and calls the function drawlines, which takes care of sequentially updating the grid with constituting points for each line, computed using the linepoints function. The resulting points look like this:


An unfortunate problem with the line drawing routine shown above is that it draws the lines sequentially, one by one, and therefore makes only very limited use of a GPU’s parallel cores. There are various ways one may mitigate this problem. One way could be to use map to draw lines in parallel. However, such an approach will require some kind of padding to ensure that the map function will compute data of the same length, no matter the length of the line. A more resource aware approach will apply a flattening technique for computing all points defined by all lines simultaneously. Using the expand function defined in the previous section, all we need to do to implement this approach is to provide (1) a function that determines for a given line, the number of points that make up the line and (2) a function that determines the n’th point of a particular line, given the index n. The code for such an approach looks as follows:

-- Parallel flattened algorithm for turning lines into
-- points, using expansion.

def points_in_line ((x1,y1),(x2,y2)) =
  i64.(1 + max (abs(x2-x1)) (abs(y2-y1)))

def get_point_in_line ((p1,p2):line) (i:i64) =
  if i64.abs(p1.0-p2.0) > i64.abs(p1.1-p2.1)
  then let dir = compare (p1.0) (p2.0)
       let sl = slope p1 p2
       in (p1.0+dir*i,
           p1.1+i64.f32(f32.round(sl*f32.i64 i)))
    else let dir = compare (p1.1) (p2.1)
         let sl = slope (p1.1,p1.0) (p2.1,p2.0)
         in (p1.0+i64.f32(f32.round(sl*f32.i64 i)),

def drawlines [h][w][n] (grid:*[h][w]i64)
                        (lines:[n]line) :[h][w]i64 =
  let (xs,ys) = expand points_in_line get_point_in_line lines
              |> unzip
  in update grid xs ys

Notice that the function get_point_in_line distinguishes between whether the number of points in the line is counted by the x-axis or the y-axis. Notice also that the flattening technique can be applied only because all lines have the same color. Otherwise, when two lines intersect, the result would be undefined, due to the fact that scatter results in undefined behaviour when multiple values are written into the same location of an array.

10.3. Drawing Triangles

Another example of an algorithm worthy of flattening is an algorithm for drawing triangles. The algorithm that we present here is based on the assumption that we already have a function for drawing multiple horizontal lines in parallel. Luckily, we have such a function! The algorithm is based on the property that any triangle can be split into an upper triangle with a horizontal baseline and a lower triangle with a horizontal ceiling. Just as the algorithm for drawing lines makes use of the expand function defined earlier, so will the flattened algorithm for drawing triangles. A triangle is defined by the three points representing the corners of the triangle:

type triangle = (point, point, point)

We shall make the assumption that the three points that define the triangle have already been sorted according to the y-axis. Thus, we can assume that the first point is the top point, the third point is the lowest point, and the second point is the middle point (according to the y-axis).

The first function we need to pass to the expand function is a function that determines the number of horizontal lines in the triangle:

def lines_in_triangle ((p,_,r):triangle) : i64 =
  i64.i32 (r.1 - p.1 + 1)

The second function we need to pass to the expand function is somewhat more involved. We first define a function dxdy, which computes the inverse slope of a line between two points:

def dxdy (a:point) (b:point) : f32 =
  let dx = b.0 - a.0
  let dy = b.1 - a.1
  in if dy == 0 then f32.i32 0
     else f32.i32 dx f32./ f32.i32 dy

We can now define the function that, given a triangle and the horizontal line number in the triangle (counted from the top), returns the corresponding line:

def get_line_in_triangle ((p,q,r):triangle) (i:i64) =
  let y = p.1 + i32.i64 i
  in if i32.i64 i <= q.1 - p.1 then     -- upper half
       let sl1 = dxdy p q
       let sl2 = dxdy p r
       let x1 = p.0 + i32.f32(f32.round(sl1 * f32.i64 i))
       let x2 = p.0 + i32.f32(f32.round(sl2 * f32.i64 i))
       in ((x1,y),(x2,y))
     else                       -- lower half
       let sl1 = dxdy r p
       let sl2 = dxdy r q
       let dy = (r.1 - p.1) - i32.i64 i
       let x1 = r.0 - i32.f32(f32.round(sl1 * f32.i32 dy))
       let x2 = r.0 - i32.f32(f32.round(sl2 * f32.i32 dy))
       in ((x1,y),(x2,y))

The function distinguishes between whether the line to compute resides in the upper or the lower subtriangle. Finally, we can define a parallel, work-efficient function that converts a number of triangles into lines:

def lines_of_triangles (xs:[]triangle) : []line =
  expand lines_in_triangle get_line_in_triangle
         (map normalize xs)

def draw (height:i64) (width:i64) : [][]i32 =

To see the code in action, here is a function that draws three triangles on a grid of height 30 and width 62:

def draw (height:i64) (width:i64) : [][]i32 =
  let grid : *[][]i32 = replicate height (replicate width 0)
  let triangles = [((5,10),(2,28),(18,20)),
  let lines = lines_of_triangles triangles
  in drawlines grid lines

The function makes use of both the lines_of_triangles function that we have defined here and the work efficient drawlines function defined previously. Here is a plot of the result:


10.4. Primes by Expansion

We saw earlier in Section 6.7 how we could implement a parallel algorithm for finding the number of primes below a given number. We also found, however, that the algorithm presented was not work-efficient. It is possible to implement a work-efficient algorithm using the expand function. We will leave the task as an exercise for the reader.

10.5. Complex Flattening

Unfortunately, the flattening-by-expansion technique does not suit all irregular problems. We shall now investigate how we can flatten a highly irregular algorithm such as quick-sort. The Quick-sort algorithm can be presented very elegantly in a functional language. The function qsort that we will define has the following type:

val qsort 't [n] : (t -> t -> bool) -> [n]t -> [n]t

Given a comparison function (<=) and an array of elements xs, qsort (<=) xs returns an array with the elements in xs sorted according to <=. Consider the following pseudo-code, which, unfortunately, is not immediately Futhark code:

def qsort (<=) xs =
  if length xs < 2 then xs
  else let (left,middle,right) = partition (<=) xs[length xs / 2] xs
       in qsort (<=) left ++ middle ++ qsort (<=) right

Here the function partition returns three arrays with the first array containing elements smaller than the pivot element xs[length xs / 2], the second array containing elements equal to the pivot element, and the third array containing elements that are greater than the pivot element. There are multiple problems with this code. First, the code makes use of recursion, which is not supported by Futhark. Second, the kind of recursion used is not tail-recursion, which means that it is not directly obvious how to eliminate the recursion. Third, it is not clear how the code can avoid using an excessive amount of memory instead of making use of inplace-updates for the sorting. Finally, it seems that the code is inherently task-parallel in nature and not particularly data-parallel.

The solution is to solve a slightly more general problem. More precisely, we shall set out to sort a number of segments, simultaneously, where each segment comprises a part of the array. Notice that we are interested in supporting a notion of partial segmentation, for which the segments of interest are disjoint but do not necessarily together span the entire array. In particular, the algorithm does not need to sort segments containing previously chosen pivot values. Such segments are already located in the correct positions, which means that they need not be moved around by the segmented quick sort implementation.

We first define a type sgm that specifies a segment of an underlying one-dimensional array of values:

At top-level, the function qsort is defined as follows, assuming a function step of type (t -> t -> bool) -> *[n]t -> []sgm -> (*[n]t,[]sgm):

def qsort [n] 't ((<=): t -> t -> bool) (xs:[n]t) : [n]t =
  if n < 2 then xs
  else (loop (xs,mms) = (copy xs,[{start=0,sz=n}])
        while length mms > 0 do

The step function is called initially with the array to be sorted as argument together with a singleton array containing a segment denoting the entire array to be sorted. The step function is called iteratively until the returned array of segments is empty. The job of the step function is to divide each segment into three new segments based on pivot values found for each segment. After the step function has reordered the values in the segments, the middle segment (containing values equal to a pivot) need not be dealt with again in the further process. A new array of segment descriptors is then defined and after removing empty segment descriptors, the resulting array of non-empty segment descriptors is returned by the step function together with the reordered value array.

Before we can define the step function, we first define a few helper functions. Using the functions segmented_iota and segmented_replicate, defined earlier, we can define a function for finding all the indexes represented by an array of segments:

-- find the indexes into values in segments
def idxs_values (sgms:[]sgm) : []i64 =
  let sgms_szs : []i64 = map (\sgm -> sgm.sz) sgms
  let is1 = segmented_replicate sgms_szs (map (\x -> x.start) sgms)
  let fs = map2 (!=) is1 (rotate (-1) is1)
  let is2 = segmented_iota fs

We also define a function info that, given an ordering function and two elements, returns -1 if the first element is less than the second element, 0 if the elements are identical, and 1 if the first element is greater than the second element:

def info 't ((<=): t -> t -> bool) (x:t) (y:t) : i64 =
  if x <= y then if y <= x then 0 else -1
  else 1

The following two functions tripit and tripadd are used for converting the classification of elements into subsegments:

def tripit (x: i64) : (i64,i64,i64) =
  if x < 0 then (1,0,0)
  else if x > 0 then (0,0,1) else (0,1,0)

def tripadd (a1:i64,e1:i64,b1:i64) (a2,e2,b2) =

We can now define the function step that, besides from an ordering function, takes as arguments (1) the array containing values and (2) an array of segments to be sorted. The function returns a pair of a reordered array of values and a new array of segments to be sorted:

def step [n][m] 't ((<=): t -> t -> bool) (xs:*[n]t) (sgms:[m]sgm)
  : (*[n]t,[]sgm) =
  -- find a pivot for each segment
  let pivots : []t = map (\sgm -> xs[sgm.start + sgm.sz/2]) sgms

  -- find index into the segment that a value belongs to
  let k = i64.sum (map (.sz) sgms)
  let idxs = replicated_iota (map (.sz) sgms) :> [k]i64
  let is = idxs_values sgms :> [k]i64

  -- for each value, how does it compare to the pivot associated
  -- with the segment?
  let infos =
    map2 (\idx i -> info (<=) xs[i] pivots[idx]) idxs is
  let orders : [](i64,i64,i64) = map tripit infos

  -- compute segment descriptor
  let flags =
    [true] ++ (map2 (!=) idxs (rotate (-1) idxs))[1:] :> [k]bool

  -- compute partition sizes for each segment
  let pszs =
    segmented_reduce tripadd (0,0,0) flags orders :> [m](i64,i64,i64)

  -- compute the new segments
  let sgms' =
    map2 (\(sgm:sgm) (a,e,b) -> [{start=sgm.start,sz=a},
                                 {start=sgm.start+a+e,sz=b}]) sgms pszs
    |> flatten
    |> filter (\sgm -> sgm.sz > 1)

  -- compute the new positions of the values in the present segments
  let newpos : []i64 =
    let where : [](i64,i64,i64) =
      segmented_scan tripadd (0,0,0) flags orders
    in map3 (\i (a,e,b) info ->
             let (x,y,_) = pszs[i]
             let s = sgms[i].start
             in if info < 0 then s+a-1
                else if info > 0 then s+b-1+x+y
                else s+e-1+x) idxs where infos

  let vs = map (\i -> xs[i]) is
  let xs' = scatter xs newpos vs

The algorithm has best case work complexity \(O(n)\) (when all elements are identical), worst case work complexity \(O(n^2)\), and an average case work complexity of \(O(n \log n)\). It has best depth complexity \(O(1)\), worst depth complexity \(O(n)\) and average depth complexity \(O(\log n)\).