3. Practical Matters

The previous chapter introduced the Futhark language, the notion of parallel programming, and the most fundamental builtin functions. However, more knowledge is needed to write real high-quality Futhark programs. This chapter discusses various practicalities around Futhark programming: how to test and debug your code (Section 3.1), how to benchmark it once it works (Section 3.2), how to use the Futhark package manager to access library code (Section 3.3), and finally how to work around compiler limitations.

3.1. Testing and Debugging

This section discusses techniques for checking the correctness of Futhark programs via unit tests, as well as the debugging facilities provided by futharki.

The testing experience for Futhark is still rather raw. There are no advanced unit testing frameworks, no test generators or doc-testing, and certainly no property-based testing. Instead, we have futhark-test, which tests entry point functions against input/output example pairs. However, it is better than nothing, and quite simple to use. futhark-test will test the program with both a compiler (futhark-c by default, but this can be changed with --compiler) and futharki.

3.1.1. Testing with futhark-test

A Futhark program may contain a test block, which is a sequence of line comments in which one of the lines contains the divider -- ==. The lines preceding the divider are ignored, while the lines after are taken as a description of a test to perform. When futhark-test is passed one or more .fut files, it will look for test blocks and perform the tests they describe.

As an example, let us consider how to test a function for matrix multiplication. The function itself is defined as thus:

entry matmul [n][m][p] (x: [n][m]i32) (y: [m][p]i32): [n][p]i32 =
  map (\xr -> map (\yc -> reduce (+) 0 (map2 (*) xr yc))
                  (transpose y))

Note that we use entry instead of let in order for the function to be callable from the outside.

We then add a test block:

-- Matrix multiplication.
-- ==
-- entry: matmul

The first line is a human-readable description, the second is the divider, and the third specifies the entry point that we wish to test. If the entry point is main, this part can be elided.

We now come to the input/output sets, which are written as follows:

-- input { [[1, 2]] [[3], [4]] }
-- output { [[11]] }
-- input { [[1, 2], [3, 4]] [[5, 6], [7, 8]] }
-- output { [[19, 22], [43, 50]] }
-- input { [[1, 2]] [[3]] }

The values are enclosed in curly braces, and multiple whitespace-separated values can be given. Only a limited subset of the Futhark value syntax is supported: Primitive values and multidimensional arrays of primitive values. In particular, no records or tuples are permitted. This subset is exactly that which is supported by compiled Futhark executables. If you have a need for testing functions that take more sophisticated input types, you will need to encode them using primitive types, and then construct them in the test function itself.

It is also possible to write negative tests, where we assert that the program must fail for a given input. In our case, when the shape of the matrices don’t match up:

-- input { [[1, 2]] [[3]] }
-- error: matmul.fut:15

We provide a regular expression matching the expected error. In this case, we just assert that the correct line number is provided.

Type inference on the input/output values is not performed, so the types must be unambiguous. This means that the usual [] notation for an empty array will not work. Instead, a special empty(t) notation is used to represent an array of row type t. For example, we can test for empty arrays as such:

-- input { empty([]i32) empty([]i32) }
-- output { empty([]i32) }

Note also that since plain integer literals are assumed to be of type i32, and plain decimal literals to be of type f64, you will need to use type suffixes (Section 2.1) to write values of other types.

As a convenience, futhark-test considers functions returning n-tuples to really be functions returning n values. This means we can put multiple values in an output stanza, just as we do with input.

Finally, it is also possible to specify test data stored in a separate file. This is useful when testing with very large datasets, in particular when they use the binary data format. This is done with the notation @ file:

-- compiled input @ big_matrices.in
-- output @ big_matrices.out

This also shows another feature of futhark-test: if we precede input with the word compiled, that test is not run with futharki. This is useful for large tests that would take too long to run interpreted. There are more ways to filter which tests and programs should be skipped for a given invocation of futhark-test; see the manual for more information. Testing a Futhark Library

A Futhark library typically comprises a number of .fut files means to be include-ed by Futhark programs. Libraries typically do not define entry points of the form required by futhark-test. Indeed, it is not unusual for Futhark libraries to consist entirely of parametric modules and higher-order functions! These are not directly accessible to futhark-test.

The recommended solution is that, for every library file foo.fut, we define a corresponding foo_tests.fut that imports foo.fut and defines a number of entry points.

For example, suppose we have sum.fut that contains the sum module from Section 2.8.3:

module type monoid = {
  type t
  val add : t -> t -> t
  val zero : t

module sum (M: monoid) = {
  let sum (a: []M.t): M.t =
    reduce M.add M.zero a

This cannot be tested directly with futhark-test, but we can define a sum_tests.fut that can:

import "sum"

-- ==
-- entry: test_sum_add_i32
-- input { [1, 2, 3, 4] }
-- output { 10 }

module sum_add_i32 = sum { type t = i32
                           let add = (i32.+)
                           let zero = 0i32

entry test_sum_add_i32 = sum_add_i32.sum

-- ==
-- entry: test_sum_prod_f32
-- input { [1f32, 2f32, 3f32, 4f32] }
-- output { 24f32 }

module sum_prod_f32 = sum { type t = f32
                            let add = (f32.*)
                            let zero = 1f32

entry test_sum_prod_f32 = sum_prod_f32.sum

You will have to use your own judgment when deciding which specific instantiations of a generic library you feel are worth testing.

3.1.2. Traces and Breakpoints

Testing is useful for determining the correctness of code, but does not in itself pinpoint the source of bugs. While you can go far simply by structuring your code as small functions that can be tested in isolation, it is sometimes necessary to inspect internal state and behaviour.

Compiled Futhark code does not possess much in the way of debugging facilities, but futharki has a couple of useful tools. Since futharki is very slow when compared to compiled code, this does mean that we can only debug on cut-down smaller testing sets, not on realistic workloads.

Specifically, we use the two functions trace and break. The trace function has the following type:

trace 't : t -> t

Semantically, trace just returns its argument unchanged, and when compiling your Futhark code, this is indeed all that will happen. However, futharki treats trace specially, and will print the argument to the screen. This is useful for seeing the value of internal variables. For example, suppose we have the program trace.fut:

let main (xs: []i32) = map (\x -> trace x + 2) xs

We can then run it with futharki to get the following output:

$ echo [1,2,3] | futharki trace.fut
Trace at trace.fut:1:24-1:49: 1i32
Trace at trace.fut:1:24-1:49: 2i32
Trace at trace.fut:1:24-1:49: 3i32
[3i32, 4i32, 5i32]

Similarly, the break function is semantically also the identity function:

break 't : t -> t

When futharki encounters break, it suspends execution and lets us inspect the variables in scope. At the moment, this works only when running an expression within the futharki REPL, not when running directly from the command line. Suppose break.fut is:

let main (xs: []i32) = map (\x -> break x + 2) xs

Then we can load and run it from futharki:

[1]> main [1,2,3]
Breaking at [1]> :1:1-1:12 -> break.fut:1:24-1:49 -> /futlib/soacs.fut:35:3-35:24 -> break.fut:1:35-1:41.
<Enter> to continue.
> x
Breaking at [1]> :1:1-1:12 -> break.fut:1:24-1:49 -> /futlib/soacs.fut:35:3-35:24 -> break.fut:1:35-1:41.
<Enter> to continue.
Breaking at [1]> :1:1-1:12 -> break.fut:1:24-1:49 -> /futlib/soacs.fut:35:3-35:24 -> break.fut:1:35-1:41.
<Enter> to continue.
[3i32, 4i32, 5i32]

Whenever we are stopped at a break point, we can enter arbitrary Futhark expressions to inspect the state of the environment. This is useful when operating on complex values.

3.2. Benchmarking

Consider an implementation of the dot product of two integer vectors:

let main (x: []i32) (y: []i32): i32 =
  reduce (+) 0 (map2 (*) x y)

We previously mentioned that, for small data sets, sequential execution is likely to be much faster than parallel execution. But how much faster? To answer this question, we need to measure the run time of the program on some data sets. This task is called benchmarking. There are many properties one can benchmark: memory usage, size of compiled executable, robustness to errors, and so forth. In this section, we are only concerned with run time. Specifically, we wish to measure wall time, which is how much time elapses in the real world from the time the computation starts, to the time it ends.

There is still some wiggle room in how we benchmark. For example, should we measure the time it takes to load the input data from disk? Or time it takes to initialise various devices and drivers? Should we perform a clean shutdown? How many times should we run the program, and should we report maximum, minimum, or average run time? We will not try to answer all of these questions, but instead merely describe the benchmarking tools provided by Futhark.

3.2.1. Simple Measurements

First, let us compile dotprod.fut to two different executables, one for each compiler:

$ futhark-c dotprod.fut -o dotprod-c
$ futhark-opencl dotprod.fut -o dotprod-opencl

One way to time execution is to use the standard time(1) tool:

$ echo [2,2,3] [4,5,6] | time ./dotprod-c
0.00user 0.00system 0:00.00elapsed ...
$ echo [2,2,3] [4,5,6] | time ./dotprod-opencl
0.20user 0.07system 0:00.29elapsed ...

It seems that dotprod-c executes in less than 10 milliseconds, while dotprod-opencl takes about 290 milliseconds. However, this comparison is not useful, as it also measures time taken to read the input (for both executables), as well as time taken to initialise the OpenCL driver (for dotprod-opencl). Recall that in a real application, the Futhark program would be compiled as a library, and the startup cost paid just once, while the program may be invoked multiple times. A more precise run-time measurement, where parsing, initialisation, and printing of results is not included, can be performed using the -t command line option, which specifies a file where the run-time (measured in microseconds) should be put:

$ echo [2,2,3] [4,5,6] | ./dotprod-c -t /dev/stderr > /dev/null

In this case, we ask for the runtime to be printed to the screen, and for the normal evaluation result to be thrown away. Apparently it takes less than one microsecond to compute the dot product of two three-element vectors on a CPU (this is not very surprising). On an AMD Vega 64 GPU:

$ echo [2,2,3] [4,5,6] | ./dotprod-opencl -t /dev/stderr > /dev/null

Over 100 microseconds! Most GPUs have fairly high launch invocation latencies, and so are not suited for small problems. We can use the futhark-dataset(1) tool to generate random test data of a desired size:

$ futhark-dataset -g [10000000]i32 -g [10000000]i32 > input

Two ten million element vectors should be enough work to amortise the GPU startup cost:

$ cat input | ./dotprod-opencl -t /dev/stderr > /dev/null
$ cat input | ./dotprod-c -t /dev/stderr > /dev/null

That’s more like it! Parallel execution is now more than ten times faster than sequential execution. This program is entirely memory-bound; on a compute-bound program we can expect much larger speedups.

You may have notice that these programs take significantly longer to run than indicated by these performance measurements. While GPU initialisation does take some time, most of the actual run-time in the example above is spent reading the data file from disk. By default, futhark-dataset produces output in a data format that is human-readable, but very slow for programs to process. We can use the -b option to make futhark-dataset generate data in an efficient binary format (which takes up less space on disk as well):

$ futhark-dataset -b -g [10000000]i32 -g [10000000]i32 > input

Reading binary data files is often orders of magnitude faster than reading textual input files. Compiled Futhark programs also support binary output via a -b option. The futhark-dataset tool can perform conversion between the binary and human-readable formats; see the manual page for more information.

3.2.2. Multiple Measurements

The technique presented in the previous section still has some problems. In particular, it is impractical if you want several measurements on the same dataset, which is in general preferable to even out noise. While you can just repeat execution the desired number of times, this method has two problems:

  1. The input file will be read multiple times, which can be slow for large data sets.
  2. It prevents the device from “warming up”, as every run re-initialises the GPU and re-uploads code.

The second point is more important than it may seem. Certain OpenCL operations (such as memory allocation) are relatively costly, and Futhark uses various caches and buffers to minimise the number of expensive OpenCL operations. However, these caches will all be cold the first time the program runs. Hence we wish to perform more than one run per program instance, so that we can take advantage of the warm caches. This method is also a more plausible proxy for real-world usage of Futhark, as Futhark is typically compiled to a library, where the same functions are called repeatedly by some client code.

Compiled Futhark executables support an -r N option that asks the program to perform N runs internally, and report runtime for each. Additionally, a non-measured warm-up run is performed initially. We can use it like this:

$ cat input | ./dotprod-opencl -t /dev/stderr -r 10 > /dev/null

Our runtimes are now much better. And importantly, there are more of them, so we can perform analyses like, such as determining the variance, to figure out how predictable the performance is.

However, we can do better still. Futhark comes with a tool for performing automated benchmark runs of programs, called futhark-bench. This tool relies on a specially formatted header comment that contains input/output pairs, just like futhark-test (see Section 3.1). The Futhark User’s Guide contains a full description, but here is a simple example. First, we introduce a new program, sumsquares.fut, with smaller data sets for convenience:

-- Given N, compute the sum of squares of the first N integers.
-- ==
-- compiled input {       1000 } output {   332833500 }
-- compiled input {    1000000 } output {   584144992 }
-- compiled input { 1000000000 } output { -2087553280 }

let main (n: i32): i32 =
  reduce (+) 0 (map (**2) (iota n))

The line containing == is used to separate the human-readable benchmark description from input-output pairs. It is also possible to keep the data set is located in an external file (see the manual page for more information.).

We can use futhark-bench to measure the performance of sumsquares.fut as follows:
$ futhark-bench sumsquares.fut
Compiling src/sumsquares.fut...
Results for src/sumsquares.fut:
dataset #0 ("1000i32"):             0.20us (avg. of 10 runs; RSD: 2.00)
dataset #1 ("1000000i32"):        290.00us (avg. of 10 runs; RSD: 0.03)
dataset #2 ("1000000000i32"):  270154.20us (avg. of 10 runs; RSD: 0.01)

These are measurements using the default compiler, which is futhark-c. If we want to see how our program performs when compiled with futhark-opencl, we can invoke futhark-bench:

$ futhark-bench --compiler=futhark-opencl sumsquares.fut
Compiling src/sumsquares.fut...
Results for src/sumsquares.fut:
dataset #0 ("1000i32"):            49.70us (avg. of 10 runs; RSD: 0.18)
dataset #1 ("1000000i32"):         44.40us (avg. of 10 runs; RSD: 0.02)
dataset #2 ("1000000000i32"):    1693.80us (avg. of 10 runs; RSD: 0.04)

We can now compare the performance of CPU execution with GPU execution. The tool takes care of the mechanics of run-time measurements, and even computes the relative standard deviation (“RSD”) of the measurements for us. The correctness of the output is also automatically checked. By default, futhark-bench performs ten runs for every data set, but this number can be changed with the --runs command line option. Unless you can articulate a good reason not to, always use futhark-bench for benchmarking.

3.3. Package Management

A Futhark package is a downloadable collection of .fut files and little more. There is a (not necessarily comprehensive) list of known packages. The following discusses only how to use packages. For authoring your own, please see the corresponding section in the User’s Guide.

3.3.1. Basic Concepts

A package is uniquely identified with a package path, which is similar to a URL, except without a protocol. At the moment, package paths are always links to Git repositories hosted on GitHub or GitLab. As an example, a package path may be github.com/athas/fut-foo.

Packages are versioned with semantic version numbers of the form X.Y.Z. Whenever versions are indicated, all three digits must always be given (that is, 1.0 is not a valid shorthand for 1.0.0).

Most futhark-pkg operations involve reading and writing a package manifest, which is always stored in a file called futhark.pkg. The futhark.pkg file is human-editable, but is in day-to-day use mainly modified by futhark-pkg automatically. You will normally have one futhark.pkg file for each of your Futhark projects. Packages are installed in a location relative to the location of futhark.pkg.

3.3.2. Installing Packages

Required packages can be added by using futhark-pkg add, for example:

$ futhark-pkg add github.com/athas/fut-foo 0.1.0

This will create a new file futhark.pkg with the following contents:

require {
  github.com/athas/fut-foo 0.1.0 #d285563c25c5152b1ae80fc64de64ff2775fa733

This lists one required package, with its package path, minimum version, and the expected commit hash. The latter is used for verification, to ensure that the contents of a package version cannot silently change.

futhark-pkg will perform network requests to determine whether a package of the given name and with the given version exists and fail otherwise (but it will not check whether the package is otherwise well-formed). The version number can be elided, in which case futhark-pkg will use the newest available version. If the package is already present in futhark.pkg, it will simply have its version requirement changed to the one specified in the command. Any dependencies of the package will not be added to futhark.pkg, but will still be downloaded by futhark-pkg sync (see below).

Adding a package with futhark-pkg add modifies futhark.pkg, but does not download the package files. This is done with futhark-pkg sync (without further options). The contents of each required dependency and any transitive dependencies will be stored in a subdirectory of lib/ corresponding to their package path. Following the earlier example:

$ futhark-pkg sync
$ tree lib
└── github.com
    └── athas
        └── fut-foo
            └── foo.fut

3 directories, 1 file

Warning: futhark-sync will remove any unrecognized files or local modifications to files in lib/. Unless you are creating your own package, you should not add anything to the lib/ directory - it is fully controlled by futhark-pkg.

Packages can be removed from futhark.pkg with:

$ futhark-pkg remove pkgpath

You will need to run futhark-sync to actually remove the files in lib/.

The intended usage is that futhark.pkg is added to version control, but lib/ is not, as the contents of lib/ can always be reproduced from futhark.pkg. However, adding lib/ works just fine as well.

3.3.3. Importing Files from Dependencies

futhark-pkg sync will populate the lib/ directory, but does not interact with the compiler in any way. The downloaded files can be imported using the import mechanism (see Section 2.8.4). For example, assuming the package contains a file foo.fut, the following top-level declaration brings all names declared in the file into scope:

import "lib/github.com/athas/fut-foo/foo"

Ultimately, everything boils down to ordinary file system semantics. This has the downside of relatively long and clumsy import paths, but the upside of predictability.

3.3.4. Upgrading Dependencies

The futhark-pkg upgrade command will update every version requirement in futhark.pkg to be the most recent available version. You still need to run futhark-pkg sync to actually retrieve the new versions. Be careful - while upgrades are safe if semantic versioning is followed correctly, this is not yet properly machine-checked, so human mistakes may occur.

As an example:

$ cat futhark.pkg
require {
  github.com/athas/fut-foo 0.1.0 #d285563c25c5152b1ae80fc64de64ff2775fa733
$ futhark-pkg upgrade
Upgraded github.com/athas/fut-foo 0.1.0 => 0.2.1.
$ cat futhark.pkg
require {
  github.com/athas/fut-foo 0.2.1 #3ddc9fc93c1d8ce560a3961e55547e5c78bd0f3e
$ futhark-pkg sync
$ tree lib
└── github.com
    └── athas
        ├── fut-bar
        │   └── bar.fut
        └── fut-foo
            └── foo.fut

4 directories, 2 files

Note that fut-foo 0.2.1 depends on github.com/athas/fut-bar, so it was fetched by futhark-pkg sync.

futhark-pkg upgrade will never upgrade across a major version number. Due to the principle of Semantic Import Versioning, a new major version is a completely different package from the point of view of the package manager. Thus, to upgrade to a new major version, you will need to use futhark-pkg add to add the new version and futhark-pkg remove to remove the old version. Or you can keep it around - it is perfectly acceptable to depend on multiple major versions of the same package, because they are really different packages.

3.4. When Things Go Wrong

Futhark is a young language and an on-going research project, and you should not expect the same predictability and quality of error messages that you may be used to from more mature languages. Further, not all Futhark compilers are guaranteed to be able to compile all Futhark programs. In general, the limitations you will encounter will tend to fall in two categories:

limitations touch upon fundamental restrictions in the target platform(s) for the Futhark compiler. For example, GPUs do not permit dynamic memory allocation inside GPU code. All memory must be pre-allocated before GPU programs are launched. This means that the Futhark compiler must be able to pre-compute the size of all intermediate arrays (symbolically), or compilation will fail.
limitations are weaknesses in the Futhark compiler that could reasonably be solved. Many implementation limitations, such as the inability to pre-compute some array sizes, or eliminate bounds checks inside parallel sections, will manifest themselves as essential limitations that could be worked around by a smarter compiler.

For example, consider this program:

let main (n: i32): [][]i32 =
  map (\i ->
         let a = (0..<i)
         let b = (0..<n-i)
         in concat a b)

At the time of this writing, the futhark-opencl compiler will fail with the not particularly illuminating error message Cannot allocate memory in kernel. The reason is that the compiler is trying to compile the map to parallel code, which involves pre-allocating memory for the a and b array. It is unable to do this, as the sizes of these two arrays depend on values that are only known inside the map, which is too late. There are various techniques the Futhark compiler could use to estimate how much memory would be needed, but these have not yet been implemented.

It is usually possible, sometimes with some pain, to come up with a workaround. We could rewrite the program as:

let main(n: i32): [][]i32 =
  let scratch = (0..<n)
  in map (\i ->
            let res = (0..<n)
            let res[i:n] = scratch[0:n-i]
            in res)

This exploits the fact that the compiler does not generate allocations for array slices or in-place updates. The only allocation is of the initial scratch, the size of which can be computed before entering the map.