Accelerators battle for computeintensive analytics in Finance
At Xcelerit, people often ask us: “Which is better, Intel Xeon Phi or NVIDIA Kepler?” The general answer has to be “it depends,” as this is heavily applicationdependent. But what if we zoomin on realworld problems in computational finance? The kinds of problems that quants in investment banks and the financial industry are dealing with every day. Let’s analyse two different financial applications and see how they perform on each platform. To cover different types of algorithms often found in finance, we chose an embarrassingly parallel MonteCarlo algorithm (with full independent paths) for the first test application, and a MonteCarlo algorithm with crosspath dependencies with iterative timestepping for the second.
[Update 1Oct2013: (American MonteCarlo application only)]
 Algorithm update and avoiding temporary storage: affects GPU and Xeon Phi heavily, updated the performance numbers
 Updated performance figures for IvyBridge CPU (Xeon E52697 v2) and Xeon Phi Processor (Xeon Phi 7120P)
 Replaced absolute times with speedups vs. sequential for better readability
Note: We chose to compare just one instrument pricing for each algorithm, as these are at the core of many performancecritical applications used in banks. For example, large batches of many thousands of instruments are priced at once, instruments are valued under many different risk scenarios, or prices are updated in realtime. As each pricing is independent, the performance typically scales linearly. In any case, reducing the execution time of a single pricing as much as possible is paramount. All source code is available on request.
Hardware Specifications
We’ll compare Intel’s Xeon Phi 5110P (Xeon Phi) and NVIDIA’s Tesla K20X (Tesla GPU) processors for our test applications. For reference, we also run on Intel’s SandyBridge Xeon E52670 server processor (SandyBridge). Their hardware specifications are:
Xeon E52670  Xeon Phi 5110P  Tesla K20X  
Cores  8  60  14 SMX 
Logical Cores  16 (HT)  240 (HT)  2,688 CUDA cores 
Frequency  2.60GHz  1.053GHz  735MHz 
GFLOPs (double)  333  1,010  1,317 
SIMD width  256 Bits  512 Bits  N/A 
Memory  ~16128GB  8GB  6GB 
Memory B/W  51.2GB/s  320GB/s  250GB/s 
Threading  software  software  hardware 
As can be seen, the features and architecture of the processors are very different, and from the theoretical GFLOPs and memory bandwidth, both the Xeon Phi and Tesla GPU should be similar in performance. However, as we’ll see later, theory and practice often diverge and for a given application the performance depends on many other factors.
For example, to fully utilize both memory and processor, and in the absence of cache, the SandyBridge CPU would need to perform 52 floating point operations for each 8byte memory operation. Anything less would make the processor wait for data from memory, i.e., the application would be memorybound. On the Xeon Phi, 25.3 operations per memory operation need to be performed, and the Tesla GPU needs 42.2 operations per memory access. If there are more operations than that, the application becomes computebound. However, with the presence of onchip caches, this picture changes completely as data kept in cache can be accessed much faster than external memory. Performance also becomes less predictable, and it cannot be determined theoretically which processor is best for a given application and whether it is compute or memorybound.
Further, applications are typically composed of sequential parts and parallel parts, and the overall application performance is heavily influenced by the fraction of the sequential part. If only half the application can be parallelised, the maximum achievable speedup by parallelisation is 2x, though it will be less than that in practice (Amdahl’s Law).
So let’s put these processors to the test for realworld applications.
Test Setup
To allow for a fair comparison of the processors, we’ve used the lowlevel programming tools and techniques to tune the application performance to the maximum for each platform, such as OpenMP threading, vectorization pragmas and attributes, handtuned CUDA kernels, native libraries (CUBLAS, MKL, etc.), and went through multiple iterations of profilerassisted performance tuning. Thus there are three different code bases for each application, all handtuned. For the Xeon Phi, the applications have been executed natively on the coprocessor (not using offloadmode).
The test systems were configured as follows:

Application 1: MonteCarlo LIBOR Swaption Portfolio Pricer
Algorithm
Details of this algorithm have been previously described. For convenience, we briefly summarise it here:
A MonteCarlo simulation is used to price a portfolio of LIBOR swaptions. Thousands of possible future development paths for the LIBOR interest rate are simulated using normallydistributed random numbers. For each of these MonteCarlo paths, the value of the swaption portfolio is computed by applying a portfolio payoff function. The equations for computing the LIBOR rates and payoff are given in Prof. Mike Giles’ notes. Furthermore, the sensitivity of the portfolio value with respect to changes in the underlying interest rate is computed using an adjoint method. This sensitivity is a Greek, called λ, and its computation is detailed in the paper Monte Carlo evaluation of sensitivities in computational finance. Both the final portfolio value and the λ value are obtained by computing the mean over all perpath values. The algorithm is illustrated in the following graph:
Performance
The algorithm has been executed on all three platforms, in double precision, for varying numbers of paths. The portfolio consisted of 15 swaptions, simulated for 40 time steps. For reference, we’ve also included a straightforward sequential C++ implementation, running on a single core of the SandyBridge CPU. The results are listed in the table below:
Paths  Sequential  SandyBridge CPU^{1,2}  Xeon Phi^{1,2}  Tesla GPU^{2} 
128K  13,062ms  694ms  603ms  146ms 
256K  26,106ms  1,399ms  795ms  280ms 
512K  52,223ms  2,771ms  1,200ms  543ms 
^{1} The SandyBridge and Phi implementations make use of SIMD vector intrinsics. 
^{2} The MRG32K3a random generator from the cuRAND library (GPU) and MKL library (SandyBridge/Phi) were used. 
For this application it can be clearly seen that NVIDIA’s Tesla GPU outperforms both other platforms significantly, being 5.1x faster than the multicore dual SandyBridge CPU and 2.2x faster than the Xeon Phi (512K paths). The Xeon Phi is 2.3x faster than the SandyBridge. Moreover, compared to the sequential implementation, the optimized SandyBridge is 19x as fast, the Phi is 43.5x as fast, and the Kepler GPU is 96x as fast.
This algorithm is embarrassingly parallel with completely independent MonteCarlo paths, which suits parallel accelerator processors very well. The full application is parallelisable with no sequential parts and very little synchronization is required. In addition, the algorithm is clearly computebound, with a substantial amount of math and relatively little memory access. This is ideal for GPUs and a similar performance can be expected for other MonteCarlo simulations with similar characteristics.
Application 2: MonteCarlo Pricing of American Options
Algorithm
This application prices a vanilla American put option using a MonteCarlo simulation. In contrast to European options, which can only be exercised at maturity, American options can be exercised at any time. This poses an additional complexity for MonteCarlo pricers, as the option’s value for early exercise at each time step needs to be evaluated and compared to the expected value when not exercising it at this step. This is typically solved with a LongstaffSchwartz algorithm, which involves computing regression coefficients across all paths to go from one time step to the previous one. Thus, the algorithm walks iteratively backwards in time, starting from the final time step, and involves a regression across paths. The MonteCarlo paths are not independent and all time steps need to be solved iteratively, with parallelisation opportunities only within each step and for the initial asset price generation. The final price is the average of all paths at time step zero. The algorithm is illustrated in the following graph:
Performance
The performance has been measured on all platforms for different numbers of MonteCarlo paths, using 256 time steps each and 3 regression coefficients. We’re giving speedups vs. a sequential reference implementation running on a single core of the IvyBridge processor. The results are:
Paths  IvyBridge CPU^{1,2}  Xeon Phi^{1,2}  Tesla GPU^{2,3} 
128K  41.8x  20.9x  52.9x 
256K  48.5x  31.0x  71.8x 
512K  45.2x  39.7x  86.6x 
^{1} The IvyBridge and Phi implementations make use of OpenMP and vectorization pragmas/attributes as much as possible. 
^{2} The MRG32K3a random generator from the cuRAND library (GPU) and MKL library (IvyBridge/Phi) were used. 
^{3} Many small CUDA kernels need to be executed on the GPU, as parallelisation can only be done within each time step 
The Tesla GPU is about twice faster than the Xeon Phi, and between 1.2x and 1.9x faster than the CPU. The difference between CPU and GPU performance is significantly less as for the LIBOR swaption pricer above, and for 128K paths the results are comparable.
This outcome can be explained by the iterative nature of the algorithm and by the heavy memory operations involved. The CPU is optimised for generalpurpose workloads, has larger caches, and can solve iterative problems very well.
Note: There is an approximate version of a MonteCarlo pricer for American options that is more suited for parallel architectures which is expected to give better performance on both Xeon Phi and Tesla GPU (Glasserman, 2003: “Monte Carlo Methods in Financial Engineering”). It computes the regression coefficients in an initial step, using a much smaller number of paths, and applies these in a full simulation over all paths later. This full simulation does not require the regression step, i.e., each MonteCarlo path is fully independent, and can therefore be fully parallelised across all paths. However, this method is controversial among quantitative analysts and gives slightly different results. Thus, we did not include it in these benchmarks.
Conclusions
We’ve seen that there is one processor that needs to be added to the picture — the commodity multicore CPU. This is already a part of many server configurations, and for some applications, e.g., MonteCarlo pricing of American options, it can give better or comparable performance than an accelerator processor when optimized correctly. Between NVIDIA’s Kepler GPUs and Xeon Phi, the GPU wins for both of our test applications.
However, the results are close and we can expect this picture to change for other applications. Further, the Xeon Phi is brandnew (released 2013), while NVIDIA Tesla range is around since 2007 — the GPU is thus a more mature accelerator platform.
Handtuning the code for all three platforms for the highest performance requires significant expertise, time, and a deep knowledge of the target hardware. One way to sidestep this effort is to use the Xcelerit SDK. With just minor modifications to existing code, performance close to manually optimized code can be achieved without any handtuning. What’s more, a single code base can then run on multicore, GPU, or any supported hybrid configuration.
[Update 18Sept2013: Followon interview with Xcelerit CTO Jörg Lotze on HPCWire]
Listen or read about it here
Jörg Lotze
Latest posts by Jörg Lotze (see all)
 White Paper: xVA – Coping with the Tsunami of Compute Load  March 4, 2014
 Benchmarks: NVIDIA Tesla K40 vs. K20X GPU  November 20, 2013
 Benchmarks: Intel Xeon Phi vs. NVIDIA Tesla GPU  September 4, 2013
Excellent analysis! However, to make it fair – you might want to compare $/perf too.
Xeon E52670 is $2.6K while Xeon Phi 5110P is $4.1k.
Good suggestion Alex, that might be the topic for another post!