6fd7ef2b55
Cache GPU RHS symbols and zero vacuum sources once
2026-04-12 22:42:58 +08:00
7064ebd5b4
Batch GPU stage downloads
2026-04-12 21:06:41 +08:00
87c581ea7c
Checkpoint stable GPU optimization baseline
2026-04-12 20:26:27 +08:00
d702aa06b9
Trim GPU restrict sync overhead
2026-04-12 19:45:34 +08:00
ce88c18265
Tune GPU RHS launch geometry
2026-04-12 18:59:59 +08:00
db2d6978b2
Reduce final GPU host downloads
2026-04-12 18:46:42 +08:00
c8977d8356
Optimize GPU RK4 stage sync path
2026-04-12 18:36:05 +08:00
d9287ea530
Fix GPU RK4 boundary and sync correctness
2026-04-12 12:13:47 +08:00
b78874ef21
Refine stable GPU AMR staging path
2026-04-10 23:37:36 +08:00
a089041c3b
Stabilize GPU AMR prolong/restrict paths
2026-04-10 21:57:58 +08:00
c578a15ecd
Fix GPU interpolation cache lifetime leaks
2026-04-10 10:29:04 +08:00
e1a0bff43c
Reduce redundant GPU host buffer preparation
2026-04-09 21:20:45 +08:00
cf3c6d6218
Stabilize GPU buffer lifecycle around regrid
2026-04-09 20:48:06 +08:00
46e94d1248
Trim constraint-only GPU downloads
2026-04-09 19:36:19 +08:00
7cd2414faa
Move constraint recomputation onto GPU path
2026-04-09 19:17:39 +08:00
4463f1d23e
Unpack intermediate sync stages directly to GPU
2026-04-09 19:01:12 +08:00
4484635f0d
Pack sync send buffers directly from GPU state
2026-04-09 18:49:11 +08:00
b0dd069a2b
Register GPU transfer buffers as pinned host memory
2026-04-09 18:36:10 +08:00
5bc67ded06
Download staged GPU sync regions incrementally
2026-04-09 18:23:05 +08:00
3b16795e78
Refresh synced GPU regions incrementally
2026-04-09 17:07:31 +08:00
5b00d49070
Reduce staged GPU host-device copies
2026-04-09 16:44:08 +08:00
42e851d19a
Cache repeated interpolation plans
2026-04-09 15:21:01 +08:00
06fa643365
Refine batched CUDA interpolation kernel
2026-04-09 15:06:11 +08:00
c47349b7a9
Add batched CUDA patch interpolation path
2026-04-09 14:56:01 +08:00
ad999e4c5a
Add guarded GPU prolong3 path scaffold
2026-04-09 14:28:36 +08:00
e1e3b4a448
Reduce GPU RK4 transfer overhead
2026-04-09 12:11:40 +08:00
49409645c0
Stabilize GPU output path and MPI sync
2026-04-09 10:57:49 +08:00
4e3946a4f0
Persist GPU RK4 stage caches
2026-04-08 20:59:15 +08:00
a0af9b8804
Trim GPU main-path transfer overhead
2026-04-08 20:16:25 +08:00
01ac1f9250
Cache GPU main-path device buffers
2026-04-08 19:43:17 +08:00
ea470737db
Add runnable GPU main-path prototype
2026-04-08 19:14:37 +08:00
8c1f4d8108
迁移C算子的循环融合和临时量消除
2026-03-03 16:20:15 +08:00
d310ef918b
bssn_rhs(fortran): migrate C kernel loop-fusion optimizations
2026-03-03 16:20:15 +08:00
b35e1b289f
设置开关关闭内存打印统计
2026-03-03 16:17:47 +08:00
05851b2c59
关闭静态负载
2026-03-03 16:17:47 +08:00
3b39583d67
fix(bssn_rhs)
2026-03-03 16:06:33 +08:00
688bdb6708
Merge pull request 'cjy-dystopia' ( #3 ) from cjy-dystopia into main
...
Reviewed-on: https://seele.tail3b303.ts.net:3000/64-BitBrainstorm_2026/AMSS-NCKU/pulls/3
2026-03-02 21:36:26 +08:00
5070134857
perf(transfer_cached): 将 per-call new/delete 的 req_node/req_is_recv/completed 数组移入 SyncCache 复用
...
避免 transfer_cached 每次调用分配释放 3 个临时数组,减少堆操作开销。
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-02 21:14:35 +08:00
4012e9d068
perf(RestrictProlong): 用 Restrict_cached/OutBdLow2Hi_cached 替换非缓存版本,Sync_finish 改为渐进式解包
...
- RestrictProlong/RestrictProlong_aux 中的 Restrict() 和 OutBdLow2Hi() 替换为 _cached 版本,
复用 gridseg 列表和 MPI 缓冲区,避免每次调用重新分配
- 新增 sync_cache_restrict/sync_cache_outbd 两组 per-level 缓存
- Sync_finish 从 MPI_Waitall 改为 MPI_Waitsome 渐进式解包,降低尾延迟
- AsyncSyncState 扩展 req_node/req_is_recv/pending_recv 字段支持渐进解包
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-02 20:48:38 +08:00
b3c367f15b
prolong3 改为先算实际 stencil 窗口;只有窗口触及对称边界时才走全域 symmetry_bd,否则只复制必需窗口。restrict3 同样改成窗口判定,无触边时仅填 ii/jj/kk 必需窗口。
2026-03-02 17:38:56 +08:00
e73911f292
perf(restrict3): shrink X-pass ii sweep to required overlap window
...
- compute fi_min/fi_max from output i-range and derive ii_lo/ii_hi
- replace full ii sweep (-1:extf(1)) with windowed sweep in Z/Y precompute passes
- keep stencil math unchanged; add bounds sanity check for ii window
2026-03-02 17:37:13 +08:00
7543d3e8c7
perf(MPatch): 用空间 bin 索引加速 Interp_Points 的 block 归属查找
...
- 为 Patch::Interp_Points 三个重载引入 BlockBinIndex(候选筛选 + 全扫回退)
- 保持原 point-in-block 判定与后续插值/通信流程不变
- 将逐点线性扫块从 O(N_points*N_blocks) 降为近似 O(N_points*k)
- 测试:bin 上限如果太大,会引入不必要的索引构建开销。将 bins 上限设为 16。
Co-authored-by: gpt-5.3-codex
2026-03-02 17:37:13 +08:00
42c69fab24
refactor(Parallel): streamline MPI communication by consolidating request handling and memory management
2026-03-02 17:37:13 +08:00
95220a05c8
optimize fdderivs core-region branch elimination for ghost_width=3
2026-03-02 17:33:26 +08:00
466b084a58
fix prolong/restrict index bounds after cherry-pick 12e1f63
2026-03-02 13:59:47 +08:00
61ccef9f97
prolong3: 减少Z-pass 冗余计算
2026-03-02 13:58:52 +08:00
e11363e06e
Optimize fdderivs: skip redundant 2nd-order work in 4th-order overlap
2026-03-02 03:21:21 +08:00
f70e90f694
prolong3:提升cache命中率
2026-03-02 03:05:35 +08:00
jaunatisblue
75dd5353b0
修改prolong
2026-03-02 02:25:25 +08:00
jaunatisblue
23a82d063b
对prolong3做访存优化
2026-03-02 02:25:25 +08:00