Dataflow – An Effective Model for High Performance Computing

Let’s look into a programming and execution model that is around for a long time, but has not been thoroughly explored as a way to program high performance computing (HPC) applications: dataflow. It is quite intuitive and has many advantages for HPC.

Dataflow is a simple paradigm to describe applications or algorithms as a directed graph of processing blocks connected together. Each block processes the data from its input ports in order to compute the output data. This is in fact the way data processing algorithms are often drawn on a white board – blocks with arrows. In the literature, processing blocks in dataflow graphs are often called actors. Here is an example of a simple dataflow graph that computes the sum of two random numbers and writes the results to memory:

Dataflow Graph

Different dataflow actors share no data (‘shared-nothing’ semantics), the only communication is through the edges they are connected to. They can have any level of granularity, with some actors as simple as an addition operation and others performing highly complex algorithms.

A good example for a practical algorithm is a Monte-Carlo simulation – used widely across different domains and often very compute-intensive. Here, a large number of simulation paths is generated, each using different random numbers, and the result is the mean (or some other statistical measure) of all the per-path values. For example, in computational finance, Monte-Carlo simulations are used to price financial derivatives or for risk assessment/management. A typical Monte-Carlo simulation can be drawn as a dataflow graph as:

Monte-Carlo Method Dataflow

The path calculation actors takes samples from the random number generator and computes the per-path results for all paths. The reduction actor computes the mean of all paths and the Writer actor writes the final results – for example into a file.

So why is the dataflow model useful? Let’s look at some of the advantages of the dataflow programming model, especially for HPC applications…

Intuitive for Data Processing Algorithms

When asked to describe a data processing algorithm, domain specialists, for example engineers, researchers, or mathematicians, often walk to the white board and draw boxes with different processing stages and connect them with arrows. This effectively is dataflow – and shows that this way of thinking is natural in many problem domains.

Parallel Execution

The dataflow programming model with its ‘shared-nothing’ semantics and explicitly expressed data dependencies provides pipeline parallelism by its very nature (a form of task parallelism). That is, all actors can execute concurrently on different sections of the data.

In case an actor does not update an internal state, it can be applied to different sections of the input and output data concurrently without affecting the overall result – a form of data parallelism. And in some cases the data which flows along the edges can be vectorised (e.g. SIMD), so that single instructions can operate in multiple data elements at once, reducing the execution time further.

The combination of these forms of parallelism can scale very well, and provides a good opportunity to exploit multicore CPUs, accelerator processors (such as GPUs, FPGAs or DSPs), and multiple nodes within a cluster/grid environment. The figure below illustrates the pipeline and data parallelism exposed by the dataflow model:

Task Parallelism and Data Parallelism

 

Deadlock-free and Race-free Code

Readers familiar with parallel programming will agree that deadlocks and race conditions are one of the most difficult to debug problems in software. Worse yet, they might go undetected until going in production and – depending on the state of the moon and the user’s machine configuration – suddenly a race condition appears. The dataflow model eliminates these hazards by design – it is impossible to introduce deadlocks or races for programmers.

Memory Locality

As a general rule, the closer the memory is to the processor, the faster it gets. Register access have literally no latency, various levels of caches are fast core or processor-specific memories, and global external DRAM is the slowest of all. Actors in a dataflow graph share no data except their inputs and outputs, and all memory used is local to the actor itself. It can therefore be placed into faster and closer types of memory by compilers and runtime libraries, potentially reducing memory access latencies significantly. Below is a diagram of the memory hierarchy of a typical multicore CPU and a GPU:

Memory Hierarchy on CPUs and GPUs

 

Managing Large Data Volumes

Another advantage that is often neglected is in processing volumes of data that do not fit into memory. Due to the streaming nature of a dataflow graph execution, a source actor can read parts of the input data from disk and pass it along the graph for processing. The sink actor can already write partial results back to disk while the source actor is reading the next inputs. This can all be done concurrently, allowing a graph to process more data than what fits into memory easily, without changing anything in the program architecture, file handling, data partitioning, etc.

Hardware Abstraction

All the above advantages do not require any special annotations by the programmer – they are simply properties of the dataflow programming model. It is therefore possible for a software tool to generate the different levels parallelism automatically and optimise the use of the different memory levels depending on the target execution hardware. A tool can therefore abstract the hardware details and free the programmer from worrying about the hardware. This not only makes it easier to program, it also allows to write portable code across different hardware platforms on a higher level.

Summary

I think it’s clear from the advantages listed here that the dataflow model can be hugely beneficial to high performance computing. It allows programmers to write algorithms using a simple paradigm – without worrying about hardware or parallelism at all – and still a high performance executable can be generated. What is required is a development tool that can automate this process – stay tuned, this will be covered in the next post.

The following two tabs change content below.

Jörg Lotze

Technical Lead and Co-Founder at Xcelerit

Leave a Reply

Your email address will not be published. Required fields are marked *


7 + = thirteen