NVIDIA freshly released their new flagship Tesla GPU, the Tesla K40. This GPU features more memory, higher clock rates, and more CUDA cores than the previous top-end card, the K20X. But what performance improvements can we expect for financial applications? We’ve put the new card to the test and compared it to the K20X using a Monte-Carlo LIBOR swaption portfolio pricer, a real-world financial algorithm that we’ve already used in other benchmarks.
Hardware Comparison
The table below shows the key hardware differences between the two cards.
Tesla K20X | Tesla K40 | |
SMX | 14 | 15 |
CUDA Cores | 2,688 | 2,880 |
Memory | 6 GB | 12 GB |
Core Frequency | 732 MHz | 745 MHz |
Max. Frequency | 784 MHz | 875 MHz |
Memory Bandwidth | 250 GB/s | 288 GB/s |
Apart from the higher clock speeds and more cores, the most notable difference is that the core frequency can be boosted significantly with the K40. That is, it can be set 17% higher, as long as the device does not exceed it’s thermal limits (in which case the clock will be automatically throttled). With the K20X, only a small clock boost of 7% is possible. Further, the 12GB of memory is good news for data intensive applications.
Monte-Carlo LIBOR Swaption Portfolio Pricing
Details of this algorithm have been described here. For convenience, we will briefly summarize it here.
A Monte-Carlo simulation is used to price a portfolio of LIBOR swaptions. Thousands of possible future development paths for the LIBOR interest rate are simulated using normally-distributed random numbers. For each of these Monte-Carlo paths, the value of the swaption portfolio is computed by applying a portfolio payoff function. The equations for computing the LIBOR rates and payoff are given here. Furthermore, the sensitivity of the portfolio value with respect to changes in the underlying interest rate is computed using an adjoint method. This sensitivity is a Greek, called λ, as detailed here. Both the final portfolio value and the λ value are obtained by computing the mean of all per-path values.
The figure below illustrates the algorithm:
Note that for high numbers of paths, this algorithm is compute-bound on the GPUs used. That is, for each memory transaction, a high number of arithmetic operations are performed. This keeps the arithmetic units busy and they do not need to wait for memory operations to complete. The added cores and higher clock speeds should therefore have a high impact on performance.
Benchmark Setup
We compare the same application, implemented using the Xcelerit SDK 2.2.2, on two different systems. Their configuration is given in the following table:
K20X System | K40 System | |
CPU | 2 Intel Xeon E5-2677 (2.9GHz) | 2 Intel Xeon E5-2670 (2.6GHz) |
GPU | NVIDIA Tesla K20Xm | NVIDIA Tesla K40m |
OS | RHEL 6.2 (64bit) | RHEL 6.2 (64bit) |
RAM | 128GB | 128GB |
GPU driver | 319.72 | 319.58 |
CUDA Toolkit | 5.5 | 5.5 |
Host Compiler | GCC 4.4 | GCC 4.4 |
Note that we are only comparing the GPU performance, so the difference in the used CPUs has no significant effect on the outcome.
Performance
We measured the computation times for the Monte-Carlo LIBOR swaption portfolio pricer on one GPU of each system, pricing a portfolio of 15 swaptions over 80 time steps and using varying numbers of Monte-Carlo paths. The run time of the full algorithm – including random number generation, data transfers, core computation, and reduction – is compared for single and double precision in the graph below. All these computation steps are running on the GPU, so the difference in the used CPUs does not affect the benchmark results. Note that using the Xcelerit SDK, no code changes were required to efficiently exploit both GPUs.
We’ve also tested the same application with the maximum frequency that each card supports (raising the clock using the nvidia-smi tool). Here the improvement is higher, as can be seen in the graph below.
For better comparison, the following table shows the speedup factors of the K40 vs. the K20X GPU for different numbers of paths:
Paths | Speedup (def. clock, single) |
Speedup (def. clock, double) |
Speedup (max. clock, single) |
Speedup (max. clock, double) |
16K | 1.15x | 1.17x | 1.21x | 1.21x |
256K | 1.15x | 1.17x | 1.21x | 1.26x |
1024K | 1.15x | 1.18x | 1.22x | 1.28x |
It is apparent that NVIDIA’s new Tesla K40 GPU gives a significant performance improvement for real-world applications – up to 1.28x in this example. Especially the higher clock boost available with the K40 makes it an attractive alternative to the K20X. Further, the speedup is roughly constant across different numbers of paths in this application – even smaller loads benefit from the new GPU. Together with the doubled memory capacity, this makes a strong case for the Tesla K40 GPU.
Jörg Lotze
Latest posts by Jörg Lotze (see all)
- White Paper: xVA – Coping with the Tsunami of Compute Load - March 4, 2014
- Benchmarks: NVIDIA Tesla K40 vs. K20X GPU - November 20, 2013
- Benchmarks: Intel Xeon Phi vs. NVIDIA Tesla GPU - September 4, 2013