Q: Can CUDA code run on CPU with comparable performance to GPU?

Yes, with minimal changes, existing CUDA code can be adapted for AADC and executed on modern CPUs using multi-threading and AVX2/AVX512 vectorization. Benchmarks show comparable performance between top-tier GPU (NVIDIA V100) and modern CPU (Intel Xeon Platinum 9282) with the added benefit of AAD support and larger memory capacity.

Q: What are the true cost savings of GPU vs CPU in production?

Cloud cost analysis shows average CPU TFLOP cost is approximately 30% higher than GPU, not the 1000x often claimed. When accounting for CUDA development costs, maintenance, specialist developer requirements, and hardware obsolescence, total cost of ownership often favors modern CPUs with AADC.

Q: What are GPU limitations for quantitative finance applications?

GPU limitations include: strict memory constraints (8-48GB vs 64-512GB+ for CPU), difficulty supporting AAD for gradient calculations, vendor lock-in to CUDA/NVIDIA, scarcity of CUDA expertise, higher maintenance costs, and challenges with complex control flow in advanced models.

Q: How much code change is needed to run CUDA on CPU with AADC?

Minimal changes required: override CUDA extensions (#define __global__, provide stubs for __syncthreads()), change native types to AADC active types (double to idouble), and add AADC kernel compilation. For vanilla options, no changes to kernel.cu needed. Path-dependent options require minimal modifications while maintaining GPU/CPU compatibility.

Executive Summary: Rethinking GPU vs CPU for Quant Finance

GPUs have long been positioned as the silver bullet for financial organizations pursuing top computational performance. Chasing proclaimed 100-1000x GPU performance gains versus CPU, many organizations made costly transitions to CUDA—often nearly a decade ago.

The Reality Today

CPU-based systems have made huge leaps in parallel-compute capacity and are now comparable to, and sometimes more performant than, GPU systems when total cost of ownership is accounted for.

This article presents a solution allowing organizations with existing CUDA projects to assess performance for transitioning from GPU to modern CPU systems. Using real-life CUDA examples, we demonstrate how existing GPU-only code can be adopted to run on CPU or GPU simultaneously.

Key Benefits of CPU + AADC Approach:

Comparable performance to top-tier GPUs
Automatic Adjoint Differentiation (AAD) - unavailable on GPU
Larger memory capacity (64-512GB+ vs 8-48GB GPU)
Easier development - standard C++ vs CUDA
Lower total cost of ownership
Escape vendor lock-in

We've made available an open-source equity pricing model benchmark implemented for both CPU and GPU, helping practitioners extract top performance from both platforms for unbiased comparison.

The GPU Promise vs Reality

What Was Promised

When GPU initially came to market, Mike Giles, renowned quantitative finance influencer and major GPU promoter, commented: "If there is a big enough market, someone will develop the product".

What Actually Happened

After a decade of technological development, the anticipated revolution for quantitative finance has not transpired. More recently, when Mike Giles was asked "Will GPUs have an impact in finance?" he replied:

"I think IT groups are keen, but quants are concerned about the effort involved... quants have enough to do without having to think about CUDA programming" [MG, pp.22,37]

Technical limitations emerged: Strict memory volume constraints (8-48GB), difficulty with complex control flow, inability to support AAD efficiently, and high development/maintenance costs.

The Real Performance Story

For years, the crucial factor favoring GPU was the ability to generate kernels for safe parallel processing. With proclaimed performance gains of 1000x, CFOs were persuaded to switch despite significant CUDA transition investment and higher software support costs.

However, according to the most trustworthy and impartial benchmark (STAC-A2), when hardware manufacturers put maximum software development effort into extracting top performance from their offerings, CPU and GPU actually run neck-and-neck.

True Cost Comparison: GPU vs CPU

Below we provide an approximate comparison of performance and operational costs for modern CPUs vs GPUs, using cloud costs as a proxy for ownership costs.

NVIDIA GPU V100 vs Intel Xeon Platinum 9282: Cost-Performance Comparison

Figure: Cost-performance comparison NVIDIA V100 GPU vs Intel Xeon Platinum 9282 CPU

The Bottom Line

Average cost of CPU TFLOP is ~30% higher than GPU

Therefore, maximum theoretical saving for a CFO is about 30%, not 1000x!

Total Cost of Ownership Considerations

Cost Factor	GPU/CUDA	CPU/AADC
Hardware Cost	Lower per TFLOP	~30% higher per TFLOP
Development Effort	High (CUDA expertise)	Low (standard C++)
Maintenance Cost	High (specialized code)	Lower (standard tooling)
Developer Availability	Scarce (CUDA specialists)	Abundant (C++ developers)
Vendor Lock-in	High (NVIDIA/CUDA)	Low (any x86 CPU)
Hardware Obsolescence	Faster (GPU generations)	Slower (longer CPU cycles)
Memory Capacity	Limited (8-48GB)	Abundant (64-512GB+)
AAD Support	Difficult/Impossible	Native (AADC)
Total Cost of Ownership	Often favors CPU when all factors considered

The Hidden Cost of Mindset Change

Performance gained in transitioning from CPU to GPU cannot be solely attributed to hardware change. It also involves a costly, and easy-to-dismiss change of mindset, stemming from the transition from object-oriented languages to a matrix-vector multiplication paradigm. This would yield performance improvements despite chip architecture.

The GPU Vendor Lock-In Problem

Many large banks made long-term commitments to CUDA/GPU years ago. Some have realized this decision created a raft of new liabilities:

GPU/CUDA Liabilities

High maintenance costs
Scarcity of CUDA expertise
Difficulty recruiting talent
Hardware reaching end-of-service
Equipment becoming obsolete
Hitting technical limits for new business needs
Cannot support AAD efficiently

CPU/AADC Benefits

Standard C++ development
Abundant developer talent
Easy recruitment
Longer hardware cycles
More memory for larger problems
Native AAD support
No vendor lock-in

There is a way out of the vendor lock-in imposed by CUDA/NVIDIA migration.

The Solution: AADC for CPU

Until now, technology similar to CUDA allowing safe multithreading was unavailable on CPU. MatLogica has developed AADC to fill this gap.

AADC vs CUDA Comparison

Capability	CUDA (GPU)	AADC (CPU)
Source Code	Requires CUDA code	Uses existing C++ OO code
Kernel Generation	Manual CUDA programming	Automatic with minimal effort
Parallelization	GPU threads	Multi-threading + AVX vectorization
Memory	8-48GB	64-512GB+
AAD Support	Difficult/Limited	Native (adjoint factor <1)
Development Effort	High (CUDA expertise)	Minimal (C++ developers)

Key Innovation: AADC can reuse existing CUDA analytics implemented for GPU and run them on scalable CPUs instead. With minimal changes, existing CUDA code adapts for AADC and executes using multi-threading and vectorization on CPU to get top performance.

How It Works: Using AADC with Existing CUDA Code

The Approach

CUDA mainly uses C++ syntax with extensions for parallel programming and GPU management. The AADC approach records scalable CPU kernels by executing original user code for one data sample (e.g., one Monte Carlo path).

Process:

Get CUDA analytics to run with AADC for one data sample on CPU
AADC records the full valuation graph
Compiles scalable CPU kernels supporting safe multithreaded execution
Takes advantage of AVX2/AVX512 native CPU vector arithmetic

Complexity: More complex problems (American Monte Carlo pricing, XVA, PDEs) handled with similar approach, modest increases in code complexity.

Implementation: Going Back to Host

To run existing CUDA code with AADC on CPU, we disable CUDA extensions to make code compatible with standard C++ compiler and ready for AADC kernel compilation.

New Compilation Unit for AADC on CPU

                                 #define double idouble               // Change native types to active AADC types
 #define bool ibool                   // to take advantage of operator overloading  

 // Override CUDA extensions: 

 #define __global__                   // Ignore __global__
 void __syncthreads() {};             // Provide simple stub implementation for CUDA 
                                      // specific API. Other methods can also be 
                                      // implemented such as CUDAMemGetInfo etc.
 struct { int x = 0; } threadIdx;     // Use the zero-th thread to record MC path 0
 struct { int x = 0; } blockIdx;
 struct { int x = 0; } blockDim;

 #include "kernel.cu"                 // Original user CUDA kernel

 // Revert overrides:

 #undef double 		     
 #undef bool
 #undef __global__    
             
 // Normal C++ code follows here

                            

After applying these fixes, kernel.cu compiles as normal C++.

Two-Step Process

Start kernel compilation and execute analytics from kernel.cu. Explicitly identify model inputs and outputs.
Use compiled CPU kernel instead of original function for subsequent Monte Carlo iterations, running simulation across multiple CPU cores with AVX2/AVX512 parallelization.

Example: Equity Derivative Pricing

We use a model for pricing single-asset Equity Linked Security options developed for CUDA/GPU to examine changes required for CPU execution using MatLogica's AADC.

Original Code: Taken from GitHub and inspired by QuantStart

Code Changes Required

Vanilla options: No changes to kernel.cu required
Path-dependent ELS option: Minimal changes needed
GPU/CPU compatibility: Maintained throughout

Availability: Source code available on request. Builds on Linux and Windows. Located in "CUDA_Example/AADC_Enabled/one-asset ELS/code" with user manual as "Manual.pdf".

Performance Benchmark Results

Let's compare performance of pricing a one-asset Equity Linked Security option using CUDA/GPU and the AADC-enabled version of CUDA code on CPU.

Test Case Details

Option Type: Path-dependent Equity Linked Security
Time Steps: 1,080
Monte Carlo Simulations: 100,000
Measurement: Process simulation and pricing logic only (random number generation excluded)

NVIDIA GPU V100 vs Intel Xeon Platinum 9282: Execution Performance Comparison

Figure: Execution performance comparison - NVIDIA V100 GPU vs Intel Xeon Platinum 9282 CPU with AADC

*Results are preliminary and being validated by hardware vendors.

Key Finding

We get comparable performance between top-of-the-line GPU/CUDA and AADC-adapted CUDA code on CPU.

Changes required for CUDA code are minimal. Apart from integrating MatLogica AADC, no additional optimizations were performed.

Additional CPU Benefits Not Shown in Benchmark

AAD Support: Automatic Greeks calculation (impossible on GPU)
Memory: Can handle much larger problems (512GB+ vs 48GB)
Complex Control Flow: Better support for sophisticated models
Development: Standard C++ tooling and debugging

Open Source Validation

This example is open source—anyone can run it themselves and recommend improvements for both CPU and GPU. We will update results as we receive feedback from hardware manufacturers and developers.

Conclusions and Recommendations

CUDA Is Not a One-Way Street

With minimal changes, it's possible to run CUDA code on scalable 64-bit CPU and take advantage of AAD as an additional benefit.

We've shown it's reasonably simple to support existing CUDA projects for dual CPU and GPU builds. This allows organizations to make informed decisions about hardware options and choose the best option depending on business needs.

Beyond Embarrassingly Parallel Problems

In this post, we used an example of an embarrassingly parallel pricing method. MatLogica has solutions for a wide range of more complex models typical in quantitative finance:

Longstaff-Schwartz pricing of callable products
XVA calculations (CVA, DVA, FVA)
PDE solvers for option pricing
American/Bermudan options with early exercise
Model calibration with implicit functions

Decision Framework

Choose GPU If...	Choose CPU + AADC If...
Simple matrix operations only No need for AAD/Greeks Memory requirements <8GB Have CUDA expertise GPU infrastructure exists	Need AAD for Greeks Large memory requirements Complex control flow Standard C++ development preferred Want to escape vendor lock-in Comparable performance needed

Migration Path

For organizations with existing CUDA investments:

Test AADC adaptation with minimal code changes
Benchmark performance on your actual models
Support dual builds (CPU and GPU simultaneously)
Gradually transition as benefits become clear
Leverage AAD benefits impossible on GPU

Test Your CUDA Code on CPU

Get a performance benchmark of your CUDA code running on modern CPUs with AADC

Request Benchmark

info@matlogica.com

Related topics: GPU vs CPU quant finance, CUDA on CPU performance, AADC CPU AAD, GPU vendor lock-in escape, Monte Carlo GPU vs CPU benchmark, NVIDIA V100 vs Intel Xeon, CUDA to CPU migration, AAD GPU limitations, quantitative finance hardware comparison, STAC-A2 benchmark, GPU memory constraints quant, CPU vectorization AVX512, escape CUDA lock-in

Blog Post - CPU vs GPU

GPU vs CPU for Quantitative Finance: The AADC Advantage