There is a large number of high performance processors available these days, each with its own characteristics, and the landscape is quickly changing with new processors being released. There are CPUs, GPUs, FPGAs, Xeon Phi, DSPs – to name just a few. How should one decide which of these processors to use for a particular task? Or should even a combination of these processors be used jointly to get the best performance? And then, how to manage the complexity of handling these devices? In the following, we’ll attempt to answer these questions, in particular for users and applications in the financial services domain.
The most common hardware processor is the CPU. These are very general purpose processors that can solve a wide range of computing problems. Today’s CPUs are multi-core (typically 2-12 cores), include several layers of cache, provide vector processing units, and include several instruction pipelines. This allows to run a number of tasks in parallel and optimise them for parallelism, cache utilisation, and vectorization. Each of the CPU cores is relatively big and complex, compared to the other processors that follow.
In recent years, graphics processing units (GPUs) became popular for general purpose computing. They include a large number of small processing cores (100-2000) in an architecture optimized for highly parallel workloads, typically paired with dedicated high performance memory. They are co-processors, used from a general purpose CPU, that can deliver very high performance for a subset of algorithms.
Earlier this year, Intel released its Xeon Phi processor (see our blogpost). This is a many-core accelerator processor based on the standard x86 CPU architecture, using many – but small – cores. It achieves its performance through high levels of parallelism, by parallel execution across cores, very wide vector units, and high memory bandwidth. It is a co-processor specifically designed for highly parallel workloads, paired with a general purpose CPU.
Traditionally, high performance tasks with real-time requirements have been managed by platforms such as field-programmable gate arrays (FPGAs) or digital signal processors (DSP). These devices are capable to achieve real-time performance for a subset of applications, consume low power, and offer high levels of parallelism. These devices are closer to integrated hardware and provide dedicated silicon for common tasks in the target application domain, most notably signal processing (e.g. multiply-add units, wide vector units, long pipelines, lookup-tables, on-chip memory, or shift registers). FPGAs and DSPs are specialised platforms aimed at specific application domains.
Each of these platforms has its own characteristics as shown in the table below:
|Raw Compute Power||medium||high||high||medium||high|
Does One Fit All?
In most real-world applications, different parts of the application have different characteristics. Some parts can only run sequentially, some can be computed in parallel, others use large amounts of data – all within the same application. And worse yet, it is generally not known a priori just how well a particular application can work on a particular processor, as specific optimisations and restructuring might be required for each.
The following table summarizes how each discussed processor type is suited to a particular type of workload:
In computational finance, many compute-intensive algorithms are Monte-Carlo simulations. These are embarrassing parallel by nature, but loading the data from files is inherently sequential. Further, especially in risk management algorithms, large amounts of data need to be handled, which makes the application memory-intensive. And for options with American exercise features, an American Monte-Carlo technique has to be used, which includes a sequential regression step. Further, some option pricing algorithms use finite difference methods for solving partial differential equations (PDEs), which is computed partly iterative. And lattice-based methods, also used in option pricing, have iterative parts as well. None of these applications fit into a single category mentioned in the table above.
Take for example a Counterparty Credit Risk / Credit Valuation Adjustment (CVA) application that a tier-one investment bank has implemented (see this blogpost). It values all financial instruments within each counterparty for a large set of market scenarios and time steps in a Monte-Carlo simulation. The amount of data handled versus the compute complexity varies significantly with the number of instruments within a counterparty. It was measured that for counterparties with less then 4 instruments, the multi-core CPU performs better than the GPU, while GPU acceleration kicks in for all other counterparties. Similarly, in our Intel Xeon Phi vs. Sandy Bridge blogpost we saw that for low numbers of paths, the CPU performed better than the Phi, and the other way around for higher path numbers.
So clearly, one does not fit all – even within the same application, different platforms should be used for different parts of the application to achieve the best possible overall performance.
Portability is Key
The traditional approach would be to maintain code bases for each specific hardware platform, each tuned for performance using techniques such as threading, vectorisation, cache optimisations, etc. Then the application can be tested on each platform to choose the best combination. With this, the code becomes platform-specific and separate maintenance for each platform is required. In a practical setup, algorithms and applications evolve with time and the hardware evolves as well. So if platform X performs slower than platform Y for the application Z today, this might not hold in future. Multiple code-bases have to be maintained to be able to react to these changes – clearly not a practical solution.
Therefore, often compromises are taken: typically easy maintenance is favoured and performance is sacrificed. That is, the code is not optimised for a particular platform and developed for a standard CPU processor, as maintaining code bases for different accelerator processors is a difficult task and the benefit is not known beforehand or does not justify the effort. The best solution however would be a single code base that is easy to maintain, written in such a way that it can run on a wide variety of hardware platforms – for example using the Xcelerit SDK. This allows to exploit hybrid hardware configurations to the best advantage and is portable to future platforms.
In the CVA computation example discussed above, this simple switch in the user code achieved the best possible performance:
if (numInstruments < 4) instVal.executeOn(XCL_CPU); else instVal.executeOn(XCL_GPU);
This combination of CPU and GPU outperformed the sequential CPU execution by a factor of 224x for the full application.
The Winning Combination is Hybrid
So which hardware infrastructure should be chosen? It depends… the best performance can be achieved using a combination of different processors. The Future Is Hybrid! Each firm will have it’s own winning combination of processors, and each application will use a different combination. And this is subject to change – the software needs to be able to to follow the hardware evolution and at the same time deliver the performance.