Remove the compile-time #error that blocked USE_CUDA_Z4C + WithShell.
Add GPU-to-CPU state sync at the start of both Z4C Step functions
(non-CPBC and CPBC) so shell CPU consumers read valid field data
after Cartesian GPU RHS with resident state.
Move bssn_cuda_use_resident_sync and bssn_cuda_download_level_state
_if_present from anonymous namespace to file scope in bssn_class.C
so derived classes (Z4C) can call them. Declare both in
bssn_rhs_cuda.h. Include bssn_rhs_cuda.h in Z4c_class.C.
Z4C shell RHS remains on CPU (Fortran Z4c_rhs_ss.f90) pending
future GPU kernel implementation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Static scheduling has lower overhead than guided for uniform workloads
(grid points all have equal computational cost).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Split prolongpointstru into search-only (prolongpointstru_search) and
append-only (prolongpointstru_append) functions. The search is read-only
and thread-safe; each thread builds private linked lists via
prolongpointstru_append, merged after the parallel loop.
This eliminates critical-section contention and delivers ~2.2x speedup:
setupintintstuff: 511s -> 252s, total init: 592s -> 267s.
Also add -qopenmp to ShellPatch.o compilation via makefile override rule
and <omp.h> include with _OPENMP guards + fallback stubs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Print MPI_Wtime breakdown of Initialize() shell setup steps and
Read_Ansorg::Compute_Constraint duration. Reveals that
ShellPatch::setupintintstuff() takes ~511s of the ~590s startup.
The function builds interpolation tables by searching every shell
grid point against all Cartesian patches — thread-safe OpenMP
parallelization is blocked by shared linked-list mutations in
prolongpointstru(), which would need a search/append split first.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Limit GPU shell RHS redirection to Step and SHStep only via #define/#undef.
Compute_Constraint, Interp_Constraint, and Constraint_Out continue using
the CPU Fortran path to avoid GPU alloc-per-call overhead during
initialization and analysis phases.
Also: wrap compare_result_gpu in #ifdef RESULT_CHECK to avoid link error.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Download BSSN StateList from GPU to CPU before AHFinderDirect_find_horizons
so that AH_Interp_Points reads valid field data instead of stale CPU arrays.
The resident-sync path keeps canonical state on GPU; without this download the
Newton iteration diverges and probes outside the computational domain.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>