Files
AMSS-NCKU/code_modification_readme.md
2026-05-20 11:16:56 +08:00

14 KiB
Raw Permalink Blame History

Code Modification Readme — asc26-plan-a

Baseline branch: baseline Target branch: asc26-plan-a Date: 2026-05-19


Overview

This branch delivers two major performance overhauls to the AMSS-NCKU numerical relativity codebase:

  1. TwoPunctureABE Multithreading — OpenMP parallelization of the TwoPunctures initial-data solver, combined with a BLAS-driven spectral derivative engine, MKL/LAPACK integration, and C/C++ rewrites of hot Fortran kernel subroutines.

  2. ABE GPU Rewrite — Complete replacement of the legacy bssn_gpu_class abstraction layer with direct, monolithic CUDA kernels for BSSN, Z4C, and Shell-Patch evolution, plus GPU-resident state management and CUDA-aware MPI.

Total diff: 84 files changed, +57,919 / 33,795 lines.


Part 1 — TwoPunctureABE Multithreading

1.1 Spectral Derivative Engine: BLAS Matrix-Multiplication Rewrite

Files: AMSS_NCKU_source/TwoPunctures.C, AMSS_NCKU_source/TwoPunctures.h

The original Derivatives_AB3 computed spectral derivatives (Chebyshev in A/B, Fourier in phi) with nested scalar loops over every grid point. The new Derivatives_AB3_MatMul expresses all derivatives as matrix-matrix products over pencil-shaped data slices, dispatched to Intel MKL cblas_dgemm.

  • Precomputed derivative matricesprecompute_derivative_matrices() builds D1_A, D2_A, D1_B, D2_B (Chebyshev collocation derivative matrices) and DF1_phi, DF2_phi (Fourier derivative matrices) once at construction time.
  • Pencil-based GEMM — data is gathered into 2D arrays where one dimension is the spectral direction and the other enumerates all remaining degrees of freedom (variables × orthogonal grid indices). Each derivative direction becomes a single cblas_dgemm call. The pure derivatives (d/dA, d/dB, d/dphi) and all mixed derivatives (d²/dAdB, d²/dAdphi, d²/dBdphi) are computed this way.
  • build_cheb_deriv_matrices / build_fourier_deriv_matrices — construct the standard Chebyshev and Fourier collocation derivative matrices.

1.2 OpenMP Parallelization of TwoPunctures

Files: AMSS_NCKU_source/TwoPunctures.C, AMSS_NCKU_source/TwoPunctures.h

Three critical regions are parallelized:

Region Directive Strategy
F_of_v residual evaluation #pragma omp parallel for collapse(3) schedule(dynamic,1) Each (i,j,k) thread stack-allocates its own l_U (derivs struct) and l_values[] to eliminate heap contention and data races
relax_omp line relaxation #pragma omp parallel for schedule(static) over k-slices Alternating be/al sweeps, each thread uses pre-allocated per-thread Thomas-algorithm workspace (ws_*_be[tid], ws_*_al[tid])
LineRelax_be_omp / LineRelax_al_omp Called from relax_omp with explicit tid Thread-safe tridiagonal solves using the thread's private scratch arrays

Per-thread workspaceallocate_workspace() allocates independent Thomas-algorithm buffers (diag, e, f, b, x, l, u, d, y) for each OpenMP thread in both be and al directions, avoiding lock contention in the inner Newton iteration.

1.3 MKL BLAS / LAPACK Integration

Files: AMSS_NCKU_source/TwoPunctures.C, AMSS_NCKU_source/gaussj.C

Function Old New Benefit
norm2 scalar sqrt(sum(v[i]²)) loop cblas_dnrm2 BLAS Level 1, SIMD-optimized
scalarproduct scalar sum(v[i]*w[i]) loop cblas_ddot BLAS Level 1, SIMD-optimized
gaussj hand-written Gauss-Jordan elimination (~100 lines) LAPACKE_dgesv + LAPACKE_dgetrf + LAPACKE_dgetri LAPACK LU with partial pivoting, asymptotically faster for the n~50 matrix sizes used in spectral elliptic solves

1.4 C/C++ Rewrite of Hot Fortran Kernels

Files (new):

  • AMSS_NCKU_source/fderivs_c.C (167 lines) — first derivatives, 2nd/4th order
  • AMSS_NCKU_source/fdderivs_c.C (332 lines) — second derivatives, 2nd/4th order
  • AMSS_NCKU_source/kodiss_c.C (117 lines) — Kreiss-Oliger dissipation
  • AMSS_NCKU_source/lopsided_c.C (255 lines) — lopsided advection
  • AMSS_NCKU_source/lopsided_kodis_c.C (248 lines) — fused advection + dissipation
  • AMSS_NCKU_source/rungekutta4_rout_c.C (212 lines) — RK4 time-stepper
  • AMSS_NCKU_source/bssn_rhs_c.C (1,287 lines) — full BSSN RHS kernel
  • AMSS_NCKU_source/z4c_rhs_c.C (725 lines) — full Z4C RHS kernel

Every C rewrite follows a consistent optimization pattern:

  • 64-byte aligned allocation (aligned_alloc(64, ...)) for AVX-512 compatibility.
  • Static buffer caching — scratch arrays (e.g., the padded fh ghost-zone buffer) persist across calls via a static pointer + capacity check, avoiding repeated malloc/free.
  • Two-pass strategy — 2nd-order finite differences are computed on the full domain first, then the interior sub-volume is overwritten with 4th-order stencils. This eliminates the per-point if/elseif branching of the original Fortran.
  • Non-overlapping shell pass — in fdderivs_c.C, the 2nd-order pass skips points that will be overwritten by the 4th-order pass, avoiding redundant computation.

1.5 Fortran Kernel Fusion: lopsided_kodis

File: AMSS_NCKU_source/lopsidediff.f90

A new lopsided_kodis subroutine fuses the advection (lopsided) and Kreiss-Oliger dissipation (kodis) operators into a single pass over the grid. Both operators previously called symmetry_bd independently to fill ghost zones — the fused version calls it once and shares the padded fh array, halving ghost-zone fill overhead for this hot path.

1.6 Build System for TwoPunctures

Files: AMSS_NCKU_source/makefile, AMSS_NCKU_source/makefile.inc

  • TP_OPTFLAGS — TwoPunctures and TwoPunctureABE are compiled with a dedicated, more aggressive optimization flag set (-O3 -march=znver5 -fp-model fast=2 -fma -ipo) separate from the main code.
  • USE_CXX_KERNELS — selects between the C rewrites and the original Fortran kernels (bssn_rhs.f90, etc.) for the CPU path.
  • USE_CXX_RK4 — independently selects between the C and Fortran RK4 stepper.
  • Intel oneTBB allocator (libtbbmalloc.so) — replaces the system malloc with a scalable thread-safe allocator, critical for multi-threaded TwoPunctures performance.
  • PGO supportPGO_MODE=opt|instrument for profile-guided optimization (currently disabled after testing showed negative benefit).
  • Toolchains — Intel oneAPI (TOOLCHAIN=intel, default) and NVIDIA HPC SDK (TOOLCHAIN=nvhpc).

Part 2 — ABE GPU Rewrite

2.1 Architecture: From Class Wrapper to Direct CUDA Kernels

The old GPU path (baseline) was organized as:

bssn_gpu_class.C/h    — C++ class managing GPU state and kernel launches
bssn_step_gpu.C       — RK4 stepper with per-substep GPU/CPU synchronisation
bssn_gpu.cu           — CUDA kernel implementations called through the class

The new GPU path (asc26-plan-a) replaces all of the above with:

bssn_rhs_cuda.cu/h    — 10,381-line monolithic CUDA BSSN RHS kernel
z4c_rhs_cuda.cu/h     —  7,909-line monolithic CUDA Z4C RHS kernel
fd_cuda_helpers.cuh   —    412-line shared finite-difference device functions
bssn_gpu_rhs_ss.cu    —  (retained, lightly modified) Shell-Patch GPU RHS

Key architectural differences:

  • The old bssn_gpu_class managed GPU memory through a C++ class with explicit allocate/free/sync methods scattered across the time-stepping logic. The new kernels operate directly on raw device pointers with a clear resident/transient memory model.
  • The old code launched many small kernels (one per derivative or algebraic term). The new code is a single monolithic kernel per formulation — all 24 BSSN evolution variables are computed in one launch with on-the-fly finite differences, eliminating kernel-launch latency and intermediate global-memory round-trips.
  • The old bssn_step_gpu.C performed per-substep GPU→CPU downloads for boundary conditions and analysis. The new model supports GPU-resident state — variables stay on device across timesteps unless explicitly requested.

2.2 GPU-Resident State Model

A central theme across ~20 commits is the "resident-sync" optimization:

Commit What it does
22c1e71 Optimize BSSN CUDA resident state and CUDA-aware MPI
090d865 Optimize BSSN CUDA state transfers
68eab03 Add opt-in BSSN CUDA resident AMR path
1ee229a Add keyed BSSN CUDA resident banks
18e9c9c Optimize BSSN CUDA resident AMR prolong
8486532 Add resident BSSN GPU point interpolation
b1974ef Stabilize device AMR restrict across regrid
ae64a22 Complete BSSN-EScalar CUDA resident transfers
83afaf1 Skip zero EM resident downloads
35b6cef Broaden cached CUDA sync paths

The resident model works as follows:

  • BSSN grid functions are allocated once on the GPU and persist across timesteps.
  • Inter-processor ghost-zone exchanges use CUDA-aware MPI — MPI directly reads/writes device memory without staging through host buffers.
  • AMR prolongation and restriction operate directly on device memory.
  • Boundary conditions and analysis routines download only the specific slices/points they need, not the full grid.
  • When EM fields are zero (pure-gravity runs), EM downloads are skipped entirely.

2.3 Z4C and Shell-Patch GPU Acceleration

Files: AMSS_NCKU_source/z4c_rhs_cuda.cu, AMSS_NCKU_source/bssn_gpu_rhs_ss.cu

  • The Z4C constraint-damped formulation gets its own 7,909-line monolithic CUDA kernel (z4c_rhs_cuda.cu), matching the BSSN kernel's architecture.
  • Shell-Patch GPU acceleration — the spherical shell boundary patches now compute on GPU with dedicated kernels in bssn_gpu_rhs_ss.cu.
  • Z4C + Shell-Patch can coexist on GPU (Phase 3 commits).
  • A CPU-side wrapper (z4c_rhs_c.C) handles the trKd + TZ_rhs contribution that remains on CPU, minimizing GPU/CPU traffic.

2.4 Finite-Difference Order Flexibility

File: AMSS_NCKU_source/fd_cuda_helpers.cuh

Shared device functions for finite-difference stencils support 2nd, 4th, 6th, and 8th order at compile time via preprocessor switches. This enables:

  • Per-run selection of convergence order without recompilation of the full kernel.
  • 8th-order AMR transfers (1064a68) for BSSN-EM.
  • 6th-order optimized AMR stencils (0076b3c).

2.5 GPU Diagnostics and Quality Assurance

File: AMSS_NCKU_GPUCheck.py (559 lines, new)

A Python-based GPU correctness verification tool that compares GPU and CPU evolution outputs. The GPU build pipeline includes optional kernel profiling switches (7683459) for performance debugging.

GPU-specific bug fixes:

  • f226498 — Fix CUDA AMR symmetry drift (incorrect ghost-zone handling under symmetry boundary conditions)
  • 2317e4a — Fix BSSN GPU resident AMR sync default
  • fea2dcc — Fix BSSN-EM runtime crash
  • dd0e20d — Fix BSSN-EScalar CUDA boundary and scalar KO
  • 5eb4994 — Fix AHF crash under CUDA resident-sync mode

2.6 Build Integration

Makefile switches:

  • USE_CUDA_BSSN=0/1 — route BSSN RHS through GPU or CPU
  • USE_CUDA_Z4C=0/1 — route Z4C RHS through GPU or CPU
  • CUDA_ARCH=sm_80 — target NVIDIA Ampere (A100)
  • NVHPC_ROOT — path to NVIDIA HPC SDK for the nvcc compiler wrapper
  • CUDA compilation flags: -O3 --ptxas-options=-v -arch=$(CUDA_ARCH)

Part 3 — Shared Infrastructure

3.1 Interp_Points Load-Balance Profiler

Files: AMSS_NCKU_source/interp_lb_profile.C, interp_lb_profile.h, interp_lb_profile_data.h, generate_interp_lb_header.py

A two-pass instrumentation system for load-balancing the Interp_Points parallel interpolation routine:

  • Pass 1 (INTERP_LB_MODE=profile): instrument each MPI rank's interpolation calls with timing, write a binary profile.
  • Pass 2 (INTERP_LB_MODE=optimize): read the profile and rebalance work across MPI ranks.

3.2 Helper Headers

Files: AMSS_NCKU_source/tool.h (33 lines), AMSS_NCKU_source/share_func.h (246 lines)

  • tool.h — shared indexing macros (idx_ex, idx_fh_F_ord2) and the symmetry_bd declaration used by all C kernel rewrites.
  • share_func.h — common utility functions shared across the C++ source files.

3.3 Plot-Only Restart Script

File: parallel_plot_helper.py (29 lines)

A lightweight restart script that skips recomputation when plotting was interrupted — reads existing checkpoint data and replots without re-running the simulation.


Performance Summary

Component Optimization Expected Impact
TwoPunctures Derivatives_AB3 Scalar loops → MKL GEMM 5-20× speedup for spectral derivative computation
TwoPunctures F_of_v OpenMP collapse(3) + stack-local variables Near-linear scaling with core count for residual evaluation
TwoPunctures gaussj Hand-written Gauss-Jordan → LAPACK LU 2-5× speedup for N~50 matrix inversion
BSSN RHS (GPU) Many small kernels → one monolithic kernel Eliminates kernel-launch overhead; 2-5× GPU throughput improvement
GPU state transfers Per-step download → resident model Eliminates ~80% of GPU↔CPU PCIe traffic
lopsided_kodis fusion Two symmetry_bd calls → one shared call ~30% reduction in ghost-zone fill cost for this operator pair
Memory allocator System malloc → Intel TBB malloc Significant reduction in malloc contention under OpenMP
C kernel rewrites Fortran → C with aligned alloc + static buffers Enables Intel compiler IPO across C/C++/Fortran boundaries; better SIMD codegen