Wrap each C kernel in #if (ghost_width == N) blocks matching Fortran stencil
coefficients from diff_new.f90, kodiss.f90, and lopsidediff.f90. Add fast-path
indexing for ord=1,4,5 in share_func.h.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Include macrodef.h (not macrodef.fh) in gpu_rhsSS_mem.h and
bssn_gpu.h so that ABEtype is visible to #if guards in CUDA files.
Remove the separate z4c_gpu_rhs_ss.cu (merged into bssn_gpu_rhs_ss.cu).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the duplicated z4c_gpu_rhs_ss.cu with a lightweight
gpu_rhs_z4c_ss wrapper inside bssn_gpu_rhs_ss.cu (guarded by
#if ABEtype==2). The wrapper:
1. Builds trKd = trK + 2*TZ on host and passes it to gpu_rhs_ss
2. After BSSN GPU returns, computes TZ_rhs = alpn1*Hcon/2 and
applies kappa1/kappa2 constraint damping on CPU
This avoids duplicate kernel definitions (linker errors) and
keeps all shell GPU code in a single file. The CPU-side Z4C
corrections are O(100K) operations — negligible vs GPU RHS time.
Also remove the separate z4c_gpu_rhs_ss.cu and its build rule.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Create z4c_gpu_rhs_ss.cu (reusing BSSN shell FD/chain-rule kernels):
- Uploads trKd = trK + 2*TZ to GPU so existing BSSN algebraic kernels
compute correct Z4C physical equations without modification
- New kern_z4c_post applies TZ_rhs = alpn1 * Hcon / 2, kappa1/kappa2
constraint damping, TZ advection (lopsided), and dissipation (kodis)
- Adds TZ/TZ_rhs to Meta struct, alloc/upload/download/free lifecycle
Add cuda_compute_rhs_z4c_ss() wrapper in Z4c_class.C matching the
Fortran f_compute_rhs_Z4c_ss signature, with #define redirection for
Step/SHStep call sites and #undef before analysis functions.
Add z4c_gpu_rhs_ss.o to ABE_CUDA_CFILES and build rule in makefile.
Add kappa1_c/kappa2_c constants to gpu_rhsSS_mem.h.
Build verified with USE_CUDA_Z4C=1 + WithShell — compiles and links
cleanly. All three Shell GPU files now coexist: bssn_gpu_rhs_ss.o
(BSSN), z4c_gpu_rhs_ss.o (Z4C), both sharing FD/chain-rule kernels.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove the compile-time #error that blocked USE_CUDA_Z4C + WithShell.
Add GPU-to-CPU state sync at the start of both Z4C Step functions
(non-CPBC and CPBC) so shell CPU consumers read valid field data
after Cartesian GPU RHS with resident state.
Move bssn_cuda_use_resident_sync and bssn_cuda_download_level_state
_if_present from anonymous namespace to file scope in bssn_class.C
so derived classes (Z4C) can call them. Declare both in
bssn_rhs_cuda.h. Include bssn_rhs_cuda.h in Z4c_class.C.
Z4C shell RHS remains on CPU (Fortran Z4c_rhs_ss.f90) pending
future GPU kernel implementation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Static scheduling has lower overhead than guided for uniform workloads
(grid points all have equal computational cost).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Split prolongpointstru into search-only (prolongpointstru_search) and
append-only (prolongpointstru_append) functions. The search is read-only
and thread-safe; each thread builds private linked lists via
prolongpointstru_append, merged after the parallel loop.
This eliminates critical-section contention and delivers ~2.2x speedup:
setupintintstuff: 511s -> 252s, total init: 592s -> 267s.
Also add -qopenmp to ShellPatch.o compilation via makefile override rule
and <omp.h> include with _OPENMP guards + fallback stubs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Print MPI_Wtime breakdown of Initialize() shell setup steps and
Read_Ansorg::Compute_Constraint duration. Reveals that
ShellPatch::setupintintstuff() takes ~511s of the ~590s startup.
The function builds interpolation tables by searching every shell
grid point against all Cartesian patches — thread-safe OpenMP
parallelization is blocked by shared linked-list mutations in
prolongpointstru(), which would need a search/append split first.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Limit GPU shell RHS redirection to Step and SHStep only via #define/#undef.
Compute_Constraint, Interp_Constraint, and Constraint_Out continue using
the CPU Fortran path to avoid GPU alloc-per-call overhead during
initialization and analysis phases.
Also: wrap compare_result_gpu in #ifdef RESULT_CHECK to avoid link error.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Download BSSN StateList from GPU to CPU before AHFinderDirect_find_horizons
so that AH_Interp_Points reads valid field data instead of stale CPU arrays.
The resident-sync path keeps canonical state on GPU; without this download the
Newton iteration diverges and probes outside the computational domain.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>