Benchmarks: NVIDIA Tesla K40 vs. K20X GPU

NVIDIA freshly released their new flagship Tesla GPU, the Tesla K40. This GPU features more memory, higher clock rates, and more CUDA cores than the previous top-end card, the K20X. But what performance improvements can we expect for financial applications? We’ve put the new card to the test and compared it to the K20X using a Monte-Carlo LIBOR swaption portfolio pricer, a real-world financial algorithm that we’ve already used in other benchmarks.

NVIDIA Tesla K40 GPU Accelerator

Hardware Comparison

The table below shows the key hardware differences between the two cards.

Tesla K20X Tesla K40
SMX 14 15
CUDA Cores 2,688 2,880
Memory 6 GB 12 GB
Core Frequency 732 MHz 745 MHz
Max. Frequency 784 MHz 875 MHz
Memory Bandwidth 250 GB/s 288 GB/s

Apart from the higher clock speeds and more cores, the most notable difference is that the core frequency can be boosted significantly with the K40. That is, it can be set 17% higher, as long as the device does not exceed it’s thermal limits (in which case the clock will be automatically throttled). With the K20X, only a small clock boost of 7% is possible. Further, the 12GB of memory is good news for data intensive applications.

Monte-Carlo LIBOR Swaption Portfolio Pricing

Details of this algorithm have been described here. For convenience, we will briefly summarize it here.

A Monte-Carlo simulation is used to price a portfolio of LIBOR swaptions. Thousands of possible future development paths for the LIBOR interest rate are simulated using normally-distributed random numbers. For each of these Monte-Carlo paths, the value of the swaption portfolio is computed by applying a portfolio payoff function. The equations for computing the LIBOR rates and payoff are given here. Furthermore, the sensitivity of the portfolio value with respect to changes in the underlying interest rate is computed using Adjoint Algorithmic Differentiation (AD). This sensitivity is a Greek, called λ, as detailed here. Both the final portfolio value and the λ value are obtained by computing the mean of all per-path values.

The figure below illustrates the algorithm:

LIBOR Swaption Portfolio Pricer algorithm

Note that for high numbers of paths, this algorithm is compute-bound on the GPUs used. That is, for each memory transaction, a high number of arithmetic operations are performed. This keeps the arithmetic units busy and they do not need to wait for memory operations to complete. The added cores and higher clock speeds should therefore have a high impact on performance.

Benchmark Setup

We compare the same application, implemented using the Xcelerit SDK 2.2.2, on two different systems. Their configuration is given in the following table:

K20X System K40 System
CPU 2 Intel Xeon E5-2677 (2.9GHz) 2 Intel Xeon E5-2670 (2.6GHz)
GPU NVIDIA Tesla K20Xm NVIDIA Tesla K40m
OS RHEL 6.2 (64bit) RHEL 6.2 (64bit)
RAM 128GB 128GB
GPU driver 319.72 319.58
CUDA Toolkit 5.5 5.5
Host Compiler GCC 4.4 GCC 4.4

Note that we are only comparing the GPU performance, so the difference in the used CPUs has no significant effect on the outcome.


We measured the computation times for the Monte-Carlo LIBOR swaption portfolio pricer on one GPU of each system, pricing a portfolio of 15 swaptions over 80 time steps and using varying numbers of Monte-Carlo paths. The run time of the full algorithm – including random number generation, data transfers, core computation, and reduction – is compared for single and double precision in the graph below. All these computation steps are running on the GPU, so the difference in the used CPUs does not affect the benchmark results. Note that using the Xcelerit SDK, no code changes were required to efficiently exploit both GPUs.

K40 vs. K20X (default clock)

We’ve also tested the same application with the maximum frequency that each card supports (raising the clock using the nvidia-smi tool). Here the improvement is higher, as can be seen in the graph below.

K40 vs. K20X (boosted clock)

For better comparison, the following table shows the speedup factors of the K40 vs. the K20X GPU for different numbers of paths:

Paths Speedup
(def. clock, single)
(def. clock, double)
(max. clock, single)
(max. clock, double)
16K 1.15x 1.17x 1.21x 1.21x
256K 1.15x 1.17x 1.21x 1.26x
1024K 1.15x 1.18x 1.22x 1.28x

It is apparent that NVIDIA’s new Tesla K40 GPU gives a significant performance improvement for real-world applications – up to 1.28x in this example. Especially the higher clock boost available with the K40 makes it an attractive alternative to the K20X. Further, the speedup is roughly constant across different numbers of paths in this application – even smaller loads benefit from the new GPU. Together with the doubled memory capacity, this makes a strong case for the Tesla K40 GPU.

The following two tabs change content below.

Jörg Lotze

Technical Lead and Co-Founder at Xcelerit

Latest posts by Jörg Lotze (see all)