MatLogica | Blog Post - CPU vs GPU

Blog Post - CPU vs GPU

GPU vs CPU for Quantitative Finance: The AADC Advantage

Run existing CUDA code on CPU with comparable performance plus automatic adjoint differentiation. Escape GPU vendor lock-in while achieving top performance on modern CPUs.

Executive Summary: Rethinking GPU vs CPU for Quant Finance

GPUs have long been positioned as the silver bullet for financial organizations pursuing top computational performance. Chasing proclaimed 100-1000x GPU performance gains versus CPU, many organizations made costly transitions to CUDA—often nearly a decade ago.

The Reality Today

CPU-based systems have made huge leaps in parallel-compute capacity and are now comparable to, and sometimes more performant than, GPU systems when total cost of ownership is accounted for.

This article presents a solution allowing organizations with existing CUDA projects to assess performance for transitioning from GPU to modern CPU systems. Using real-life CUDA examples, we demonstrate how existing GPU-only code can be adopted to run on CPU or GPU simultaneously.

Key Benefits of CPU + AADC Approach:

  • Comparable performance to top-tier GPUs
  • Automatic Adjoint Differentiation (AAD) - unavailable on GPU
  • Larger memory capacity (64-512GB+ vs 8-48GB GPU)
  • Easier development - standard C++ vs CUDA
  • Lower total cost of ownership
  • Escape vendor lock-in

We've made available an open-source equity pricing model benchmark implemented for both CPU and GPU, helping practitioners extract top performance from both platforms for unbiased comparison.

The GPU Promise vs Reality

What Was Promised

When GPU initially came to market, Mike Giles, renowned quantitative finance influencer and major GPU promoter, commented: "If there is a big enough market, someone will develop the product".

What Actually Happened

After a decade of technological development, the anticipated revolution for quantitative finance has not transpired. More recently, when Mike Giles was asked "Will GPUs have an impact in finance?" he replied:

"I think IT groups are keen, but quants are concerned about the effort involved... quants have enough to do without having to think about CUDA programming" [MG, pp.22,37]

Technical limitations emerged: Strict memory volume constraints (8-48GB), difficulty with complex control flow, inability to support AAD efficiently, and high development/maintenance costs.

The Real Performance Story

For years, the crucial factor favoring GPU was the ability to generate kernels for safe parallel processing. With proclaimed performance gains of 1000x, CFOs were persuaded to switch despite significant CUDA transition investment and higher software support costs.

However, according to the most trustworthy and impartial benchmark (STAC-A2), when hardware manufacturers put maximum software development effort into extracting top performance from their offerings, CPU and GPU actually run neck-and-neck.

True Cost Comparison: GPU vs CPU

Below we provide an approximate comparison of performance and operational costs for modern CPUs vs GPUs, using cloud costs as a proxy for ownership costs.

NVIDIA GPU V100 vs Intel Xeon Platinum 9282: Cost-Performance Comparison

Figure: Cost-performance comparison NVIDIA V100 GPU vs Intel Xeon Platinum 9282 CPU

The Bottom Line

Average cost of CPU TFLOP is ~30% higher than GPU

Therefore, maximum theoretical saving for a CFO is about 30%, not 1000x!

Total Cost of Ownership Considerations

Cost Factor GPU/CUDA CPU/AADC
Hardware Cost Lower per TFLOP ~30% higher per TFLOP
Development Effort High (CUDA expertise) Low (standard C++)
Maintenance Cost High (specialized code) Lower (standard tooling)
Developer Availability Scarce (CUDA specialists) Abundant (C++ developers)
Vendor Lock-in High (NVIDIA/CUDA) Low (any x86 CPU)
Hardware Obsolescence Faster (GPU generations) Slower (longer CPU cycles)
Memory Capacity Limited (8-48GB) Abundant (64-512GB+)
AAD Support Difficult/Impossible Native (AADC)
Total Cost of Ownership Often favors CPU when all factors considered

The Hidden Cost of Mindset Change

Performance gained in transitioning from CPU to GPU cannot be solely attributed to hardware change. It also involves a costly, and easy-to-dismiss change of mindset, stemming from the transition from object-oriented languages to a matrix-vector multiplication paradigm. This would yield performance improvements despite chip architecture.

The GPU Vendor Lock-In Problem

Many large banks made long-term commitments to CUDA/GPU years ago. Some have realized this decision created a raft of new liabilities:

GPU/CUDA Liabilities

  • High maintenance costs
  • Scarcity of CUDA expertise
  • Difficulty recruiting talent
  • Hardware reaching end-of-service
  • Equipment becoming obsolete
  • Hitting technical limits for new business needs
  • Cannot support AAD efficiently

CPU/AADC Benefits

  • Standard C++ development
  • Abundant developer talent
  • Easy recruitment
  • Longer hardware cycles
  • More memory for larger problems
  • Native AAD support
  • No vendor lock-in

There is a way out of the vendor lock-in imposed by CUDA/NVIDIA migration.

The Solution: AADC for CPU

Until now, technology similar to CUDA allowing safe multithreading was unavailable on CPU. MatLogica has developed AADC to fill this gap.

AADC vs CUDA Comparison

Capability CUDA (GPU) AADC (CPU)
Source Code Requires CUDA code Uses existing C++ OO code
Kernel Generation Manual CUDA programming Automatic with minimal effort
Parallelization GPU threads Multi-threading + AVX vectorization
Memory 8-48GB 64-512GB+
AAD Support Difficult/Limited Native (adjoint factor <1)
Development Effort High (CUDA expertise) Minimal (C++ developers)

Key Innovation: AADC can reuse existing CUDA analytics implemented for GPU and run them on scalable CPUs instead. With minimal changes, existing CUDA code adapts for AADC and executes using multi-threading and vectorization on CPU to get top performance.

How It Works: Using AADC with Existing CUDA Code

The Approach

CUDA mainly uses C++ syntax with extensions for parallel programming and GPU management. The AADC approach records scalable CPU kernels by executing original user code for one data sample (e.g., one Monte Carlo path).

Process:

  1. Get CUDA analytics to run with AADC for one data sample on CPU
  2. AADC records the full valuation graph
  3. Compiles scalable CPU kernels supporting safe multithreaded execution
  4. Takes advantage of AVX2/AVX512 native CPU vector arithmetic

Complexity: More complex problems (American Monte Carlo pricing, XVA, PDEs) handled with similar approach, modest increases in code complexity.

Implementation: Going Back to Host

To run existing CUDA code with AADC on CPU, we disable CUDA extensions to make code compatible with standard C++ compiler and ready for AADC kernel compilation.

New Compilation Unit for AADC on CPU

After applying these fixes, kernel.cu compiles as normal C++.

Two-Step Process

  1. Start kernel compilation and execute analytics from kernel.cu. Explicitly identify model inputs and outputs.
  2. Use compiled CPU kernel instead of original function for subsequent Monte Carlo iterations, running simulation across multiple CPU cores with AVX2/AVX512 parallelization.

Example: Equity Derivative Pricing

We use a model for pricing single-asset Equity Linked Security options developed for CUDA/GPU to examine changes required for CPU execution using MatLogica's AADC.

Original Code: Taken from GitHub and inspired by QuantStart

Code Changes Required

  • Vanilla options: No changes to kernel.cu required
  • Path-dependent ELS option: Minimal changes needed
  • GPU/CPU compatibility: Maintained throughout

Availability: Source code available on request. Builds on Linux and Windows. Located in "CUDA_Example/AADC_Enabled/one-asset ELS/code" with user manual as "Manual.pdf".

Performance Benchmark Results

Let's compare performance of pricing a one-asset Equity Linked Security option using CUDA/GPU and the AADC-enabled version of CUDA code on CPU.

Test Case Details

  • Option Type: Path-dependent Equity Linked Security
  • Time Steps: 1,080
  • Monte Carlo Simulations: 100,000
  • Measurement: Process simulation and pricing logic only (random number generation excluded)
NVIDIA GPU V100 vs Intel Xeon Platinum 9282: Execution Performance Comparison

Figure: Execution performance comparison - NVIDIA V100 GPU vs Intel Xeon Platinum 9282 CPU with AADC

*Results are preliminary and being validated by hardware vendors.

Key Finding

We get comparable performance between top-of-the-line GPU/CUDA and AADC-adapted CUDA code on CPU.

Changes required for CUDA code are minimal. Apart from integrating MatLogica AADC, no additional optimizations were performed.

Additional CPU Benefits Not Shown in Benchmark

  • AAD Support: Automatic Greeks calculation (impossible on GPU)
  • Memory: Can handle much larger problems (512GB+ vs 48GB)
  • Complex Control Flow: Better support for sophisticated models
  • Development: Standard C++ tooling and debugging

Open Source Validation

This example is open source—anyone can run it themselves and recommend improvements for both CPU and GPU. We will update results as we receive feedback from hardware manufacturers and developers.

Conclusions and Recommendations

CUDA Is Not a One-Way Street

With minimal changes, it's possible to run CUDA code on scalable 64-bit CPU and take advantage of AAD as an additional benefit.

We've shown it's reasonably simple to support existing CUDA projects for dual CPU and GPU builds. This allows organizations to make informed decisions about hardware options and choose the best option depending on business needs.

Beyond Embarrassingly Parallel Problems

In this post, we used an example of an embarrassingly parallel pricing method. MatLogica has solutions for a wide range of more complex models typical in quantitative finance:

  • Longstaff-Schwartz pricing of callable products
  • XVA calculations (CVA, DVA, FVA)
  • PDE solvers for option pricing
  • American/Bermudan options with early exercise
  • Model calibration with implicit functions

Decision Framework

Choose GPU If... Choose CPU + AADC If...
  • Simple matrix operations only
  • No need for AAD/Greeks
  • Memory requirements <8GB
  • Have CUDA expertise
  • GPU infrastructure exists
  • Need AAD for Greeks
  • Large memory requirements
  • Complex control flow
  • Standard C++ development preferred
  • Want to escape vendor lock-in
  • Comparable performance needed

Migration Path

For organizations with existing CUDA investments:

  1. Test AADC adaptation with minimal code changes
  2. Benchmark performance on your actual models
  3. Support dual builds (CPU and GPU simultaneously)
  4. Gradually transition as benefits become clear
  5. Leverage AAD benefits impossible on GPU

Test Your CUDA Code on CPU

Get a performance benchmark of your CUDA code running on modern CPUs with AADC

Request Benchmark

info@matlogica.com

Related topics: GPU vs CPU quant finance, CUDA on CPU performance, AADC CPU AAD, GPU vendor lock-in escape, Monte Carlo GPU vs CPU benchmark, NVIDIA V100 vs Intel Xeon, CUDA to CPU migration, AAD GPU limitations, quantitative finance hardware comparison, STAC-A2 benchmark, GPU memory constraints quant, CPU vectorization AVX512, escape CUDA lock-in