d96ca6ed2a
Add two-node MPI launch configuration
2026-03-30 21:13:46 +08:00
60ad63e8cc
Isolate TwoPuncture from ABE OMP settings
2026-03-30 21:00:20 +08:00
087d034ee3
Use wall time for timestep logging
2026-03-30 20:38:41 +08:00
5f664716ab
Enable OpenMP task parallelism for C kernels
2026-03-30 20:34:34 +08:00
8c1f4d8108
迁移C算子的循环融合和临时量消除
2026-03-03 16:20:15 +08:00
d310ef918b
bssn_rhs(fortran): migrate C kernel loop-fusion optimizations
2026-03-03 16:20:15 +08:00
b35e1b289f
设置开关关闭内存打印统计
2026-03-03 16:17:47 +08:00
05851b2c59
关闭静态负载
2026-03-03 16:17:47 +08:00
3b39583d67
fix(bssn_rhs)
2026-03-03 16:06:33 +08:00
688bdb6708
Merge pull request 'cjy-dystopia' ( #3 ) from cjy-dystopia into main
...
Reviewed-on: https://seele.tail3b303.ts.net:3000/64-BitBrainstorm_2026/AMSS-NCKU/pulls/3
2026-03-02 21:36:26 +08:00
5070134857
perf(transfer_cached): 将 per-call new/delete 的 req_node/req_is_recv/completed 数组移入 SyncCache 复用
...
避免 transfer_cached 每次调用分配释放 3 个临时数组,减少堆操作开销。
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-02 21:14:35 +08:00
4012e9d068
perf(RestrictProlong): 用 Restrict_cached/OutBdLow2Hi_cached 替换非缓存版本,Sync_finish 改为渐进式解包
...
- RestrictProlong/RestrictProlong_aux 中的 Restrict() 和 OutBdLow2Hi() 替换为 _cached 版本,
复用 gridseg 列表和 MPI 缓冲区,避免每次调用重新分配
- 新增 sync_cache_restrict/sync_cache_outbd 两组 per-level 缓存
- Sync_finish 从 MPI_Waitall 改为 MPI_Waitsome 渐进式解包,降低尾延迟
- AsyncSyncState 扩展 req_node/req_is_recv/pending_recv 字段支持渐进解包
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-02 20:48:38 +08:00
b3c367f15b
prolong3 改为先算实际 stencil 窗口;只有窗口触及对称边界时才走全域 symmetry_bd,否则只复制必需窗口。restrict3 同样改成窗口判定,无触边时仅填 ii/jj/kk 必需窗口。
2026-03-02 17:38:56 +08:00
e73911f292
perf(restrict3): shrink X-pass ii sweep to required overlap window
...
- compute fi_min/fi_max from output i-range and derive ii_lo/ii_hi
- replace full ii sweep (-1:extf(1)) with windowed sweep in Z/Y precompute passes
- keep stencil math unchanged; add bounds sanity check for ii window
2026-03-02 17:37:13 +08:00
7543d3e8c7
perf(MPatch): 用空间 bin 索引加速 Interp_Points 的 block 归属查找
...
- 为 Patch::Interp_Points 三个重载引入 BlockBinIndex(候选筛选 + 全扫回退)
- 保持原 point-in-block 判定与后续插值/通信流程不变
- 将逐点线性扫块从 O(N_points*N_blocks) 降为近似 O(N_points*k)
- 测试:bin 上限如果太大,会引入不必要的索引构建开销。将 bins 上限设为 16。
Co-authored-by: gpt-5.3-codex
2026-03-02 17:37:13 +08:00
42c69fab24
refactor(Parallel): streamline MPI communication by consolidating request handling and memory management
2026-03-02 17:37:13 +08:00
95220a05c8
optimize fdderivs core-region branch elimination for ghost_width=3
2026-03-02 17:33:26 +08:00
466b084a58
fix prolong/restrict index bounds after cherry-pick 12e1f63
2026-03-02 13:59:47 +08:00
61ccef9f97
prolong3: 减少Z-pass 冗余计算
2026-03-02 13:58:52 +08:00
e11363e06e
Optimize fdderivs: skip redundant 2nd-order work in 4th-order overlap
2026-03-02 03:21:21 +08:00
f70e90f694
prolong3:提升cache命中率
2026-03-02 03:05:35 +08:00
jaunatisblue
75dd5353b0
修改prolong
2026-03-02 02:25:25 +08:00
jaunatisblue
23a82d063b
对prolong3做访存优化
2026-03-02 02:25:25 +08:00
524d1d1512
Merge pull request 'cjy-dystopia' ( #2 ) from cjy-dystopia into main
...
Reviewed-on: https://seele.tail3b303.ts.net:3000/64-BitBrainstorm_2026/AMSS-NCKU/pulls/2
2026-03-01 19:22:09 +08:00
44efb2e08c
预赛最终版本v1.0.0: 确定PGO和原负载均衡方案在当前版本造成负优化已经回退
2026-03-01 18:04:25 +08:00
16013081e0
Optimize symmetry_bd with stride-based fast paths
2026-03-01 15:50:56 +08:00
03416a7b28
perf(polint): add uniform-grid fast path for barycentric n=6
2026-03-01 13:26:39 +08:00
cca3c16c2b
perf(polint): add switchable barycentric ordn=6 path
2026-03-01 13:20:46 +08:00
e5231849ee
perf(polin3): switch to lagrange-weight tensor contraction
2026-03-01 13:04:33 +08:00
a766e49ff0
perf(polint): add ordn=6 specialized neville path
2026-03-01 12:39:53 +08:00
1a518cd3f6
Optimize average2: use DO CONCURRENT loop form
2026-03-01 00:41:32 +08:00
1dc622e516
Optimize average2: replace array expression with explicit loops
2026-03-01 00:33:01 +08:00
3046a0ccde
Optimize prolong3: hoist bounds check out of inner loop
2026-03-01 00:17:30 +08:00
d4ec69c98a
Optimize prolong3: replace parity branches with coefficient lookup
2026-02-28 23:59:57 +08:00
2c0a3055d4
Optimize prolong3: precompute coarse index/parity maps
2026-02-28 23:53:30 +08:00
1eba73acbe
先关闭绑核心,发现速度对比:不绑定核心+SCX>绑核心+SCX
2026-02-28 23:27:44 +08:00
b91cfff301
Add switchable C RK4 kernel and build toggle
2026-02-28 21:12:19 +08:00
e29ca2dca9
build: switch allocator option to oneTBB tbbmalloc
2026-02-28 17:16:00 +08:00
6493101ca0
bssn_rhs_c: recompute contracted Gamma terms to remove temp arrays
2026-02-28 16:34:23 +08:00
169986cde1
bssn_rhs_c: compute div_beta on-the-fly to remove temp array
2026-02-28 16:25:57 +08:00
1fbc213888
bssn_rhs_c: remove gxx/gyy/gzz temporaries in favor of dxx/dyy/dzz+1
2026-02-28 15:50:52 +08:00
6024708a48
derivs_c: split low/high stencil regions to reduce branch overhead
2026-02-28 15:42:31 +08:00
bc457d981e
bssn_rhs_c: merge lopsided+kodis with shared symmetry buffer
2026-02-28 15:23:01 +08:00
51dead090e
bssn_rhs_c: 融合最终RHS两循环为一循环,用局部变量传递fij中间值 (Modify 6)
...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 13:49:45 +08:00
34d6922a66
fdderivs_c: 全量清零改为只清零边界面,减少无效内存写入
...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 13:20:06 +08:00
8010ad27ed
kodiss_c: 收紧循环范围消除边界无用迭代和分支判断
...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 13:04:21 +08:00
38e691f013
bssn_rhs_c: 融合Christoffel修正+trK_rhs两循环为一循环 (Modify 5)
...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 12:57:07 +08:00
808387aa11
bssn_rhs_c: 融合fxx/Gamxa+Gamma_rhs_part2两循环为一循环 (Modify 4)
...
fxx/fxy/fxz和Gamxa/ya/za保留在局部标量中直接复用于Gamma_rhs part2,减少数组读写
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 11:14:35 +08:00
c2b676abf2
bssn_rhs_c: 融合A^{ij}升指标+Gamma_rhs_part1两循环为一循环 (Modify 3)
...
A^{ij}六分量保留在局部标量中直接复用于Gamma_rhs计算,减少Rxx..Ryz数组的额外读取
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 11:02:27 +08:00
2c60533501
bssn_rhs_c: 融合逆度规+Gamma约束+Christoffel三循环为一循环 (Modify 2)
...
逆度规计算结果保留在局部标量中直接复用,减少对gupxx..gupzz数组的重复读取,每步加速0.01秒
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 10:57:40 +08:00