64-BitBrainstorm_2026/AMSS-NCKU

Fork 0

Files

CGH0S7 dd5b7561c1 Case closed.

2026-05-20 11:16:56 +08:00

14 KiB

Raw Permalink Blame History

Code Modification Readme — `asc26-plan-a`

Baseline branch: baseline Target branch: asc26-plan-a Date: 2026-05-19

Overview

This branch delivers two major performance overhauls to the AMSS-NCKU numerical relativity codebase:

TwoPunctureABE Multithreading — OpenMP parallelization of the TwoPunctures initial-data solver, combined with a BLAS-driven spectral derivative engine, MKL/LAPACK integration, and C/C++ rewrites of hot Fortran kernel subroutines.
ABE GPU Rewrite — Complete replacement of the legacy bssn_gpu_class abstraction layer with direct, monolithic CUDA kernels for BSSN, Z4C, and Shell-Patch evolution, plus GPU-resident state management and CUDA-aware MPI.

Total diff: 84 files changed, +57,919 / −33,795 lines.

Part 1 — TwoPunctureABE Multithreading

1.1 Spectral Derivative Engine: BLAS Matrix-Multiplication Rewrite

Files: AMSS_NCKU_source/TwoPunctures.C, AMSS_NCKU_source/TwoPunctures.h

The original Derivatives_AB3 computed spectral derivatives (Chebyshev in A/B, Fourier in phi) with nested scalar loops over every grid point. The new Derivatives_AB3_MatMul expresses all derivatives as matrix-matrix products over pencil-shaped data slices, dispatched to Intel MKL cblas_dgemm.

Precomputed derivative matrices — precompute_derivative_matrices() builds D1_A, D2_A, D1_B, D2_B (Chebyshev collocation derivative matrices) and DF1_phi, DF2_phi (Fourier derivative matrices) once at construction time.
Pencil-based GEMM — data is gathered into 2D arrays where one dimension is the spectral direction and the other enumerates all remaining degrees of freedom (variables × orthogonal grid indices). Each derivative direction becomes a single cblas_dgemm call. The pure derivatives (d/dA, d/dB, d/dphi) and all mixed derivatives (d²/dAdB, d²/dAdphi, d²/dBdphi) are computed this way.
build_cheb_deriv_matrices / build_fourier_deriv_matrices — construct the standard Chebyshev and Fourier collocation derivative matrices.

1.2 OpenMP Parallelization of TwoPunctures

Files: AMSS_NCKU_source/TwoPunctures.C, AMSS_NCKU_source/TwoPunctures.h

Three critical regions are parallelized:

Region	Directive	Strategy
`F_of_v` residual evaluation	`#pragma omp parallel for collapse(3) schedule(dynamic,1)`	Each (i,j,k) thread stack-allocates its own `l_U` (derivs struct) and `l_values[]` to eliminate heap contention and data races
`relax_omp` line relaxation	`#pragma omp parallel for schedule(static)` over k-slices	Alternating be/al sweeps, each thread uses pre-allocated per-thread Thomas-algorithm workspace (`ws__be[tid]`, `ws__al[tid]`)
`LineRelax_be_omp` / `LineRelax_al_omp`	Called from `relax_omp` with explicit `tid`	Thread-safe tridiagonal solves using the thread's private scratch arrays

Per-thread workspace — allocate_workspace() allocates independent Thomas-algorithm buffers (diag, e, f, b, x, l, u, d, y) for each OpenMP thread in both be and al directions, avoiding lock contention in the inner Newton iteration.

1.3 MKL BLAS / LAPACK Integration

Files: AMSS_NCKU_source/TwoPunctures.C, AMSS_NCKU_source/gaussj.C

Function	Old	New	Benefit
`norm2`	scalar `sqrt(sum(v[i]²))` loop	`cblas_dnrm2`	BLAS Level 1, SIMD-optimized
`scalarproduct`	scalar `sum(v[i]*w[i])` loop	`cblas_ddot`	BLAS Level 1, SIMD-optimized
`gaussj`	hand-written Gauss-Jordan elimination (~100 lines)	`LAPACKE_dgesv` + `LAPACKE_dgetrf` + `LAPACKE_dgetri`	LAPACK LU with partial pivoting, asymptotically faster for the `n~50` matrix sizes used in spectral elliptic solves

1.4 C/C++ Rewrite of Hot Fortran Kernels

Files (new):

AMSS_NCKU_source/fderivs_c.C (167 lines) — first derivatives, 2nd/4th order
AMSS_NCKU_source/fdderivs_c.C (332 lines) — second derivatives, 2nd/4th order
AMSS_NCKU_source/kodiss_c.C (117 lines) — Kreiss-Oliger dissipation
AMSS_NCKU_source/lopsided_c.C (255 lines) — lopsided advection
AMSS_NCKU_source/lopsided_kodis_c.C (248 lines) — fused advection + dissipation
AMSS_NCKU_source/rungekutta4_rout_c.C (212 lines) — RK4 time-stepper
AMSS_NCKU_source/bssn_rhs_c.C (1,287 lines) — full BSSN RHS kernel
AMSS_NCKU_source/z4c_rhs_c.C (725 lines) — full Z4C RHS kernel

Every C rewrite follows a consistent optimization pattern:

64-byte aligned allocation (aligned_alloc(64, ...)) for AVX-512 compatibility.
Static buffer caching — scratch arrays (e.g., the padded fh ghost-zone buffer) persist across calls via a static pointer + capacity check, avoiding repeated malloc/free.
Two-pass strategy — 2nd-order finite differences are computed on the full domain first, then the interior sub-volume is overwritten with 4th-order stencils. This eliminates the per-point if/elseif branching of the original Fortran.
Non-overlapping shell pass — in fdderivs_c.C, the 2nd-order pass skips points that will be overwritten by the 4th-order pass, avoiding redundant computation.

1.5 Fortran Kernel Fusion: lopsided_kodis

File: AMSS_NCKU_source/lopsidediff.f90

A new lopsided_kodis subroutine fuses the advection (lopsided) and Kreiss-Oliger dissipation (kodis) operators into a single pass over the grid. Both operators previously called symmetry_bd independently to fill ghost zones — the fused version calls it once and shares the padded fh array, halving ghost-zone fill overhead for this hot path.

1.6 Build System for TwoPunctures

Files: AMSS_NCKU_source/makefile, AMSS_NCKU_source/makefile.inc

TP_OPTFLAGS — TwoPunctures and TwoPunctureABE are compiled with a dedicated, more aggressive optimization flag set (-O3 -march=znver5 -fp-model fast=2 -fma -ipo) separate from the main code.
USE_CXX_KERNELS — selects between the C rewrites and the original Fortran kernels (bssn_rhs.f90, etc.) for the CPU path.
USE_CXX_RK4 — independently selects between the C and Fortran RK4 stepper.
Intel oneTBB allocator (libtbbmalloc.so) — replaces the system malloc with a scalable thread-safe allocator, critical for multi-threaded TwoPunctures performance.
PGO support — PGO_MODE=opt|instrument for profile-guided optimization (currently disabled after testing showed negative benefit).
Toolchains — Intel oneAPI (TOOLCHAIN=intel, default) and NVIDIA HPC SDK (TOOLCHAIN=nvhpc).

Part 2 — ABE GPU Rewrite

2.1 Architecture: From Class Wrapper to Direct CUDA Kernels

The old GPU path (baseline) was organized as:

bssn_gpu_class.C/h    — C++ class managing GPU state and kernel launches
bssn_step_gpu.C       — RK4 stepper with per-substep GPU/CPU synchronisation
bssn_gpu.cu           — CUDA kernel implementations called through the class

The new GPU path (asc26-plan-a) replaces all of the above with:

bssn_rhs_cuda.cu/h    — 10,381-line monolithic CUDA BSSN RHS kernel
z4c_rhs_cuda.cu/h     —  7,909-line monolithic CUDA Z4C RHS kernel
fd_cuda_helpers.cuh   —    412-line shared finite-difference device functions
bssn_gpu_rhs_ss.cu    —  (retained, lightly modified) Shell-Patch GPU RHS

Key architectural differences:

The old bssn_gpu_class managed GPU memory through a C++ class with explicit allocate/free/sync methods scattered across the time-stepping logic. The new kernels operate directly on raw device pointers with a clear resident/transient memory model.
The old code launched many small kernels (one per derivative or algebraic term). The new code is a single monolithic kernel per formulation — all 24 BSSN evolution variables are computed in one launch with on-the-fly finite differences, eliminating kernel-launch latency and intermediate global-memory round-trips.
The old bssn_step_gpu.C performed per-substep GPU→CPU downloads for boundary conditions and analysis. The new model supports GPU-resident state — variables stay on device across timesteps unless explicitly requested.

2.2 GPU-Resident State Model

A central theme across ~20 commits is the "resident-sync" optimization:

Commit	What it does
`22c1e71`	Optimize BSSN CUDA resident state and CUDA-aware MPI
`090d865`	Optimize BSSN CUDA state transfers
`68eab03`	Add opt-in BSSN CUDA resident AMR path
`1ee229a`	Add keyed BSSN CUDA resident banks
`18e9c9c`	Optimize BSSN CUDA resident AMR prolong
`8486532`	Add resident BSSN GPU point interpolation
`b1974ef`	Stabilize device AMR restrict across regrid
`ae64a22`	Complete BSSN-EScalar CUDA resident transfers
`83afaf1`	Skip zero EM resident downloads
`35b6cef`	Broaden cached CUDA sync paths

The resident model works as follows:

BSSN grid functions are allocated once on the GPU and persist across timesteps.
Inter-processor ghost-zone exchanges use CUDA-aware MPI — MPI directly reads/writes device memory without staging through host buffers.
AMR prolongation and restriction operate directly on device memory.
Boundary conditions and analysis routines download only the specific slices/points they need, not the full grid.
When EM fields are zero (pure-gravity runs), EM downloads are skipped entirely.

2.3 Z4C and Shell-Patch GPU Acceleration

Files: AMSS_NCKU_source/z4c_rhs_cuda.cu, AMSS_NCKU_source/bssn_gpu_rhs_ss.cu

The Z4C constraint-damped formulation gets its own 7,909-line monolithic CUDA kernel (z4c_rhs_cuda.cu), matching the BSSN kernel's architecture.
Shell-Patch GPU acceleration — the spherical shell boundary patches now compute on GPU with dedicated kernels in bssn_gpu_rhs_ss.cu.
Z4C + Shell-Patch can coexist on GPU (Phase 3 commits).
A CPU-side wrapper (z4c_rhs_c.C) handles the trKd + TZ_rhs contribution that remains on CPU, minimizing GPU/CPU traffic.

2.4 Finite-Difference Order Flexibility

File: AMSS_NCKU_source/fd_cuda_helpers.cuh

Shared device functions for finite-difference stencils support 2nd, 4th, 6th, and 8th order at compile time via preprocessor switches. This enables:

Per-run selection of convergence order without recompilation of the full kernel.
8th-order AMR transfers (1064a68) for BSSN-EM.
6th-order optimized AMR stencils (0076b3c).

2.5 GPU Diagnostics and Quality Assurance

File: AMSS_NCKU_GPUCheck.py (559 lines, new)

A Python-based GPU correctness verification tool that compares GPU and CPU evolution outputs. The GPU build pipeline includes optional kernel profiling switches (7683459) for performance debugging.

GPU-specific bug fixes:

f226498 — Fix CUDA AMR symmetry drift (incorrect ghost-zone handling under symmetry boundary conditions)
2317e4a — Fix BSSN GPU resident AMR sync default
fea2dcc — Fix BSSN-EM runtime crash
dd0e20d — Fix BSSN-EScalar CUDA boundary and scalar KO
5eb4994 — Fix AHF crash under CUDA resident-sync mode

2.6 Build Integration

Makefile switches:

USE_CUDA_BSSN=0/1 — route BSSN RHS through GPU or CPU
USE_CUDA_Z4C=0/1 — route Z4C RHS through GPU or CPU
CUDA_ARCH=sm_80 — target NVIDIA Ampere (A100)
NVHPC_ROOT — path to NVIDIA HPC SDK for the nvcc compiler wrapper
CUDA compilation flags: -O3 --ptxas-options=-v -arch=$(CUDA_ARCH)

Part 3 — Shared Infrastructure

3.1 Interp_Points Load-Balance Profiler

Files: AMSS_NCKU_source/interp_lb_profile.C, interp_lb_profile.h, interp_lb_profile_data.h, generate_interp_lb_header.py

A two-pass instrumentation system for load-balancing the Interp_Points parallel interpolation routine:

Pass 1 (INTERP_LB_MODE=profile): instrument each MPI rank's interpolation calls with timing, write a binary profile.
Pass 2 (INTERP_LB_MODE=optimize): read the profile and rebalance work across MPI ranks.

3.2 Helper Headers

Files: AMSS_NCKU_source/tool.h (33 lines), AMSS_NCKU_source/share_func.h (246 lines)

tool.h — shared indexing macros (idx_ex, idx_fh_F_ord2) and the symmetry_bd declaration used by all C kernel rewrites.
share_func.h — common utility functions shared across the C++ source files.

3.3 Plot-Only Restart Script

File: parallel_plot_helper.py (29 lines)

A lightweight restart script that skips recomputation when plotting was interrupted — reads existing checkpoint data and replots without re-running the simulation.

Performance Summary

Component	Optimization	Expected Impact
TwoPunctures `Derivatives_AB3`	Scalar loops → MKL GEMM	5-20× speedup for spectral derivative computation
TwoPunctures `F_of_v`	OpenMP collapse(3) + stack-local variables	Near-linear scaling with core count for residual evaluation
TwoPunctures `gaussj`	Hand-written Gauss-Jordan → LAPACK LU	2-5× speedup for N~50 matrix inversion
BSSN RHS (GPU)	Many small kernels → one monolithic kernel	Eliminates kernel-launch overhead; 2-5× GPU throughput improvement
GPU state transfers	Per-step download → resident model	Eliminates ~80% of GPU↔CPU PCIe traffic
`lopsided_kodis` fusion	Two `symmetry_bd` calls → one shared call	~30% reduction in ghost-zone fill cost for this operator pair
Memory allocator	System malloc → Intel TBB malloc	Significant reduction in malloc contention under OpenMP
C kernel rewrites	Fortran → C with aligned alloc + static buffers	Enables Intel compiler IPO across C/C++/Fortran boundaries; better SIMD codegen

14 KiB Raw Permalink Blame History Unescape Escape

Code Modification Readme — asc26-plan-a