Large financial services companies have vast compute resources available, organised into computing grids (i.e., federations of compute resources to reach a common goal). But, as Murex’ Pierre Spatz explains, they don’t use them as supercomputers. “They use them as grids of small computers. It means that most of their code is in fact mono-threaded.” In contrast, the world’s top supercomputing sites often use clusters of machines in a very different – and more efficient – way. In the following, we will explain and demonstrate why – and illustrate how financial services firms can improve the efficiency of their existing compute grids. We only look at compute-intensive workloads, for example found in risk management and derivatives pricing.
Grid Usage in Finance
Most of the code used today in financial services is mono-threaded. This is for historical reasons, i.e., relying on legacy code from the days of single-core processors, and for simplicity of developing and maintaining the code. For large parallel computation tasks, for example when computing enterprise-wide Credit Valuation Adjustments (CVA), each processor core is given a subset of the bank’s portfolio to be valued under a set of the market scenarios. The individual results from each core are aggregated later, using files or network communication for gathering the individual results.
The supercomputers, used for example in physics or chemistry, are viewed as single machines running distributed applications. The applications themselves are multi-threaded and/or multi-process, using message passing, thread synchronisation, or other techniques to communicate between parallel tasks.
But, what difference does this make to performance and efficiency for financial applications?
Existing Grids Inefficiencies
Running individual mono-threaded processes on each processor core means that each process has its own memory space and all common data is replicated in each process. For example, considering a CVA computation, the instrument data and scenario data used by the processes is largely the same. However, each process stores its own copy. This results in a higher memory usage than necessary, which can be substantial in risk computations. It can also lead to congestion on the PCI bus, as lots of independent memory queries have to be served by the system.
As the individual processes tend to operate on larger chunks (i.e., parallelism with a coarse granularity), load-balancing between processes is difficult to achieve. One process might work on a smaller counterparty than another, causing it to finish computation earlier and leaving it’s processor core idle. This results in under-utilized compute resources.
For aggregation, the individual results need to be stored and gathered. Depending on the application, this might be done through saving intermediate data into files or through network communication. This can pose a significant overhead compared to multi-threaded implementations where the intermediate data can be shared in memory.
As a case study to illustrate the points above, we’ve chosen an application from the field of counterparty credit risk (CCR). It computes the expected exposure of a bank to a set of counterparties using a Monte-Carlo simulation. This is the basis for many CCR measures, such as CVA. The input set is composed of nearly 30,000 counterparties, containing from 1 to more than 1,000 instruments each. The counterparties are priced under 5,000 market scenarios for 50 time steps, in two independent Monte-Carlo simulations. The first uses regular market scenarios, and the second simulates under stressed scenarios assuming a major financial crisis.
To measure the performance of a typical grid implementation, individual processes are launched, each processing a subset of the counterparties. This is compared to a multi-threaded implementation of the same code, where all pricings are performed jointly within the same application. We’ve measured the overall runtime (aggregated over all nodes), the memory requirements per node, the processor core utilization, and the overhead of saving individual results into files for aggregation (the amount of Disk I/O needed in both cases). The test machine setup is as follows:
- 2 Intel Xeon E2-2670 (8 cores each, hyperthreading disabled)
- 64GB RAM
- RedHat Enterprise Linux 6.2 (64bit)
- GCC 4.4
And here are the results:
|Memory usage (MB)||19,268||2,372|
|Disk I/O (MB)||15,163||1,824|
It becomes clear that a multi-threaded parallel application is far superior to the grid approach traditionally used in banks. The multi-threaded application is 2.2x times faster than the same application running in individual processes. These performance enhancements are achieved with existing hardware but require a major redesign of the software if not employing the right tools. Even greater speedups can be achieved with hardware accelerators such as GPUs.
More details can be found in our whitepaper Accelerating CVA Computations the Easy Way.
Latest posts by Jörg Lotze (see all)
- White Paper: xVA – Coping with the Tsunami of Compute Load - March 4, 2014
- Benchmarks: NVIDIA Tesla K40 vs. K20X GPU - November 20, 2013
- Benchmarks: Intel Xeon Phi vs. NVIDIA Tesla GPU - September 4, 2013