Although in the TOP500 Supercomputing Sites list (November 2012) 5 out of the top 10 systems are based on IBM POWER processors, this platform is rarely used in the financial industry for compute-intensive analytics. We believe that with increasing computation demands and more and more data to be processed, e.g. in risk computations, the POWER platform might gain popularity in financial applications.
IBM’s POWER7+ Processor
The POWER7+ processor is a superscalar symmetric multiprocessor, i.e., a processor composed of 8 identical cores, each with 12 execution units capable of running multiple scalar (or vector) instructions in parallel. Let’s start with an overview of the some hardware specs (taken from here):
|Max execution threads per core:||4|
|L2 cache per core:||256KB|
|Execution units per core:||12|
|Memory bandwidth per chip:||100GB/s sustained|
|Max POWER7+ sockets per system:||32|
This means that with a single processor, 32 hardware threads are available. Each processor core can execute 4 threads simultaneously in order to get high utilisation of the 12 execution units present. These are:
- 2 fixed point units,
- 2 load/store units,
- 4 double precision floating point units,
- 1 vector unit (Altivec/VSX – 128bit SIMD instruction set),
- 1 branch unit,
- 1 condition register unit, and
- one decimal floating point unit.
The processor employs aggressive out-of-order instruction execution, reordering instructions both at compile time and run time. This allows to utilise all execution units efficiently in parallel and to hide latencies (e.g. if one thread is waiting for data, another thread can be executed on the execution units).
Further, note that up to 32 sockets are possible within the same machine – resulting in a maximum of 1024 hardware threads. Several terrabytes of RAM are supported too (e.g., up to 4TB on a Power 770 Server, or 16TB on a Power 795 Server). This shows that these systems are aimed for highly parallel applications within the same machine. This saves the expensive network communications typically required for applications with high degrees of parallelism, as less nodes in a grid are required. This is good news for tackling highly complex financial algorithms – both large amounts of data and many parallel tasks are no problem for POWER7+ based systems.
The officially supported operating systems are IBM’s proprietary UNIX system, called AIX, or Linux (RedHat or SUSE). Both operating systems provide the familiar UNIX interface with the GNU tools and commonly used graphical interfaces. Supported compilers are the GNU compiler collection (GCC) and IBM’s own compilers for both operating systems. In addition, most job schedulers and grid middlewares found in finance are available for the POWER platform. The Windows operating system is not supported.
Test Case: Monte-Carlo LIBOR Swaption Portfolio Pricing
As a test case, we chose a financial Monte-Carlo simulation. Details of this algorithm have been previously described. For convenience, we briefly summarise it below:
A Monte-Carlo simulation is used to price a portfolio of LIBOR swaptions. Thousands of possible future development paths for the LIBOR interest rate are simulated using normally-distributed random numbers. For each of these Monte-Carlo paths, the value of the swaption portfolio is computed by applying a portfolio payoff function. The equations for computing the LIBOR rates and payoff are given in Prof. Mike Giles’ notes. Furthermore, the sensitivity of the portfolio value with respect to changes in the underlying interest rate is computed using an adjoint method. This sensitivity is a Greek, called λ, and its computation is detailed in the paper Monte Carlo evaluation of sensitivities in computational finance. Both the final portfolio value and the λ value are obtained by computing the mean of all per-path values.
The figure below illustrates the algorithm:
(Note that this algorithm has been implemented using a prototype version of the Xcelerit SDK for POWER systems.)
In the figure below, we’ve measured the speedup of the full application implemented using the Xcelerit SDK vs. a sequential reference running on one processor core. The test system had one POWER7+ processor (8 cores), with 4 threads per core, running RedHat Enterprise Linux 6.2. The application was compiled with GCC 4.4.
As can be seen, the system achieves large performance gains even with a relatively low number of paths, starting from nearly 12x at 4K paths and improving to around 17x with 1M paths. Note that using 4 threads per core on the 8 cores works really well here. Further, double precision scales better than single precision – good news for the financial applications where accuracy is key.
Benefit of Running Multiple Threads On Each Core
Looking at the results above, it is interesting to measure the effect of running multiple threads per core. We’ve run the same tests as above with different threads/core settings. The benefit is approximately constant over the number of Monte-Carlo paths chosen, so we give averages in the table below:
Clearly, 4 threads per core keep the processor highly utilised, exploiting the 12 execution units best and hiding latencies efficiently. Thus a speedup normally expected for double the cores can be achieved for the above application.
The POWER7+ processor specifications and the performance achieved with a financial pricing algorithm suggest that POWER systems are well suitable for financial applications. Especially their scalability is promising – up to 1024 hardware threads can be supported within a single system – and the large amounts of memory make this platform attractive for the highly demanding financial algorithms that are commonplace today.