AMSS-NCKU

64-BitBrainstorm_2026/AMSS-NCKU

Fork 0

Commit Graph

Select branches

Hide Pull Requests

Trigger-Discipline

aocc-cuda

aocc-legacy

asc26-plan-a

asc26-plan-b

asc26-temp

baseline

chb-copilot-test

chb-cuda-new

chb-local

chb-new

chb-parallel

chb-rebase-wip

chb-replace

chb-twopunctures

cjy

cjy-cassius

cjy-dystopia

cjy-falcons

cjy-falcons-k3

cjy-goldsteps

cjy-leonhardt

cjy-oneapi

cjy-oneapi-laptop

cjy-oneapi-openmp

cjy-oneapi-opus-hotfix

cjy-oneapi-opus-openmp

cjy-oneapi-opus-preview

cjy-oneapi-opus-rhs-preview

cjy-oneapi-opus-windfall

cjy-oneapi-parallel

cjy-oneapi-preview

cjy-oneapi-test

cjy-spirit

cjy-vitality

gcc-legacy

gpu-maybe-final

hxh-new

hxh-omp

legacy

lzd-cuda

main

main-upstream

oneapi-legacy

yx-fmisc

yx-mpi

yx-prolong

yx-vacation

yx_new_split

#1

#2

#3

23a82d063b 对prolong3做访存优化 jaunatisblue 2026-03-02 01:16:10 +08:00
672b7ebee2 修改prolong jaunatisblue 2026-03-02 02:01:07 +08:00
63bf180159 对prolong3做访存优化 jaunatisblue 2026-03-02 01:16:10 +08:00
524d1d1512 Merge pull request 'cjy-dystopia' (#2) from cjy-dystopia into main gh0s7 2026-03-01 19:22:09 +08:00
44efb2e08c 预赛最终版本v1.0.0: 确定PGO和原负载均衡方案在当前版本造成负优化已经回退 CGH0S7 2026-03-01 18:04:25 +08:00
16013081e0 Optimize symmetry_bd with stride-based fast paths CGH0S7 2026-03-01 15:50:56 +08:00
e7a02e8f72 perf(polint): add uniform-grid fast path for barycentric n=6 CGH0S7 2026-03-01 13:26:39 +08:00
8dad910c6c perf(polint): add switchable barycentric ordn=6 path CGH0S7 2026-03-01 13:20:46 +08:00
01b4cf71d1 perf(polin3): switch to lagrange-weight tensor contraction CGH0S7 2026-03-01 13:04:33 +08:00
66dabe8cc4 perf(polint): add ordn=6 specialized neville path CGH0S7 2026-03-01 12:39:53 +08:00
03416a7b28 perf(polint): add uniform-grid fast path for barycentric n=6 CGH0S7 2026-03-01 13:26:39 +08:00
cca3c16c2b perf(polint): add switchable barycentric ordn=6 path CGH0S7 2026-03-01 13:20:46 +08:00
e5231849ee perf(polin3): switch to lagrange-weight tensor contraction CGH0S7 2026-03-01 13:04:33 +08:00
a766e49ff0 perf(polint): add ordn=6 specialized neville path CGH0S7 2026-03-01 12:39:53 +08:00
19b0e79692 黄老板逆天重写 hxh-new wingrew 2026-03-01 05:48:40 +08:00
1a518cd3f6 Optimize average2: use DO CONCURRENT loop form CGH0S7 2026-03-01 00:41:32 +08:00
1dc622e516 Optimize average2: replace array expression with explicit loops CGH0S7 2026-03-01 00:33:01 +08:00
3046a0ccde Optimize prolong3: hoist bounds check out of inner loop CGH0S7 2026-03-01 00:17:30 +08:00
d4ec69c98a Optimize prolong3: replace parity branches with coefficient lookup CGH0S7 2026-02-28 23:59:57 +08:00
2c0a3055d4 Optimize prolong3: precompute coarse index/parity maps CGH0S7 2026-02-28 23:53:30 +08:00
1eba73acbe 先关闭绑核心，发现速度对比：不绑定核心+SCX>绑核心+SCX CGH0S7 2026-02-28 23:27:44 +08:00
588fb675a0 尝试划分4block但是效果不好，转为研究访存 yx_new_split jaunatisblue 2026-02-28 21:17:02 +08:00
b91cfff301 Add switchable C RK4 kernel and build toggle CGH0S7 2026-02-28 21:12:19 +08:00
e29ca2dca9 build: switch allocator option to oneTBB tbbmalloc CGH0S7 2026-02-28 17:16:00 +08:00
6493101ca0 bssn_rhs_c: recompute contracted Gamma terms to remove temp arrays CGH0S7 2026-02-28 16:34:23 +08:00
169986cde1 bssn_rhs_c: compute div_beta on-the-fly to remove temp array CGH0S7 2026-02-28 16:25:57 +08:00
1fbc213888 bssn_rhs_c: remove gxx/gyy/gzz temporaries in favor of dxx/dyy/dzz+1 CGH0S7 2026-02-28 15:50:52 +08:00
6024708a48 derivs_c: split low/high stencil regions to reduce branch overhead CGH0S7 2026-02-28 15:42:31 +08:00
abf2f640e4 add fused symmetry packing kernels for orders 2 and 3 in BSSN RHS ianchb 2026-02-28 15:35:14 +08:00
94f40627aa refine GPU dispatch initialization and optimize H2D/D2H data transfers ianchb 2026-02-28 15:23:41 +08:00
bc457d981e bssn_rhs_c: merge lopsided+kodis with shared symmetry buffer CGH0S7 2026-02-28 15:23:01 +08:00
51dead090e bssn_rhs_c: 融合最终RHS两循环为一循环，用局部变量传递fij中间值 (Modify 6) CGH0S7 2026-02-28 13:49:45 +08:00
34d6922a66 fdderivs_c: 全量清零改为只清零边界面，减少无效内存写入 CGH0S7 2026-02-28 13:20:06 +08:00
8010ad27ed kodiss_c: 收紧循环范围消除边界无用迭代和分支判断 CGH0S7 2026-02-28 13:04:21 +08:00
38e691f013 bssn_rhs_c: 融合Christoffel修正+trK_rhs两循环为一循环 (Modify 5) CGH0S7 2026-02-28 12:57:07 +08:00
808387aa11 bssn_rhs_c: 融合fxx/Gamxa+Gamma_rhs_part2两循环为一循环 (Modify 4) CGH0S7 2026-02-28 11:14:35 +08:00
d94c31c5c4 [WIP]Implement multi-GPU support in BSSN RHS and add profiling for H2D/D2H transfers ianchb 2026-02-28 01:21:45 +08:00
724e9cd415 [WIP]Add CUDA support for BSSN RHS with new kernel and update makefiles ianchb 2026-02-27 21:46:43 +08:00
c001939461 Add Lagrange interpolation subroutine and update calls in prolongrestrict modules ianchb 2026-02-27 12:46:39 +08:00
94d236385d Revert "skip redundant MPI ghost cell syncs for stages 0, 1 & 2" ianchb 2026-02-26 20:56:41 +08:00
780f1c80d0 skip redundant MPI ghost cell syncs for stages 0, 1 & 2 ianchb 2026-02-26 16:16:33 +08:00
c2b676abf2 bssn_rhs_c: 融合A^{ij}升指标+Gamma_rhs_part1两循环为一循环 (Modify 3) CGH0S7 2026-02-28 11:02:27 +08:00
2c60533501 bssn_rhs_c: 融合逆度规+Gamma约束+Christoffel三循环为一循环 (Modify 2) CGH0S7 2026-02-28 10:57:40 +08:00
aabe74c098 短暂的4划分但是以失败告终 jaunatisblue 2026-02-28 08:23:30 +08:00
318b5254cc 根据组委会邮件要求更新检测脚本，增加对3D向量和三个分量分别检测RMS小于1.0% CGH0S7 2026-02-27 17:38:21 +08:00
3cee05f262 Merge branch 'cjy-oneapi-opus-hotfix' CGH0S7 2026-02-27 15:13:40 +08:00
e0b5e012df 引入 PGO 式两遍编译流程，将 Interp_Points 负载均衡优化合法化 cjy-oneapi-opus-hotfix CGH0S7 2026-02-27 15:10:22 +08:00
6b2464b80c Interp_Points 负载均衡：热点 block 拆分与 rank 重映射 jaunatisblue 2026-02-27 15:07:40 +08:00
9c33e16571 增加C算子PGO文件 CGH0S7 2026-02-27 11:30:36 +08:00
f7ada421cf skip redundant MPI ghost cell syncs for stages 0, 1 & 2 chb-replace ianchb 2026-02-26 16:16:33 +08:00
45b7a43576 补全C算子和Fortran算子的数学差异 CGH0S7 2026-02-26 15:48:11 +08:00
dfb79e3e11 Initialize output arrays to zero in fdderivs_c.C and fderivs_c.C ianchb 2026-02-26 11:48:28 +08:00
fb9f153662 Initialize output arrays to zero in fdderivs_c.C and fderivs_c.C ianchb 2026-02-26 11:48:28 +08:00
f147f79ffa 修改block划分，对负载高的rank所在block进行划分，添加到空rank，空rank是平移得到的 yx-vacation jaunatisblue 2026-02-26 09:40:46 +08:00
d2c2214fa1 补充TwoPunctureABE专用PGO插桩文件 CGH0S7 2026-02-25 23:06:17 +08:00
e157ea3a23 合并 chb-replace：C++ 算子替换 Fortran bssn_rhs，添加回退开关与独立 PGO profdata CGH0S7 2026-02-25 22:50:46 +08:00
f5a63f1e42 Revert "Fix timing: replace clock() with MPI_Wtime() for wall-clock measurement" ianchb 2026-02-25 22:21:43 +08:00
284ab80baf Remove OpenMP from C rewrite kernel ianchb 2026-02-25 13:15:24 +00:00
09b937c022 Fix timing: replace clock() with MPI_Wtime() for wall-clock measurement copilot-swe-agent[bot] 2026-02-25 12:42:47 +00:00
8a9c775705 Replace Fortran bssn_rhs with C implementation and add C helper kernels wingrew 2026-02-25 18:59:33 +08:00
d942122043 更新PGO文件 CGH0S7 2026-02-25 18:25:20 +08:00
a5c713a7e0 完善PGO机制 CGH0S7 2026-02-25 17:22:56 +08:00
9e6b25163a 更新 PGO profdata 并为 ABE 插桩编译添加 PGO_MODE 开关 CGH0S7 2026-02-25 17:00:55 +08:00
efc8bf29ea 按需失效同步缓存：Regrid_Onelevel 改为返回 bool CGH0S7 2026-02-25 16:00:26 +08:00
ccf6adaf75 提供正确的macrodef.h避免llm被误导 CGH0S7 2026-02-25 11:47:14 +08:00
e2bc472845 优化绑核逻辑，取消硬编码改为智能识别 CGH0S7 2026-02-25 10:59:32 +08:00
8abac8dd88 对rank运行时间统计，两个函数分别在不同的计算中被调用，因此我对两个重载的函数分别进行了mpi实际计算时间的统计，对于第一个PatList_Interp_Points 调用 Interp_points，我取排名前三的rank时间，发现每次只有一个rank时间较长，Rank [ 52]: Calc 0.000012 s jaunatisblue 2026-02-24 14:33:04 +08:00
e6329b013d Merge branch 'cjy-oneapi-opus-hotfix' CGH0S7 2026-02-20 14:18:33 +08:00
82339f5282 Merge lopsided advection + kodis dissipation to share symmetry_bd buffer ianchb 2026-02-20 09:45:37 +08:00
94f38c57f9 Don't hardcode pgo profile path ianchb 2026-02-20 08:48:25 +08:00
cc06e30404 Apply async Sync optimization to Z4c_class using Sync_start/finish pattern chb-new ianchb 2026-02-20 09:50:40 +08:00
25c79dc7cd Merge lopsided advection + kodis dissipation to share symmetry_bd buffer ianchb 2026-02-20 09:45:37 +08:00
a725d34dd3 Don't hardcode pgo profile path ianchb 2026-02-20 08:48:25 +08:00
85d1e8de87 Add Intel SIMD vectorization directives to hot-spot functions CGH0S7 2026-02-14 00:43:39 +08:00
b32675ba99 1. Pass 1（357-395行）：遍历所有 Patch，对每个 block 计算含ghost zone 的实际体积，存入 block_volumes 2. Greedy LPT（397-414行）：按体积从大到小排序，依次分配给当前负载最小的 rank 3. Pass 2（416-555行）：原来的 block创建循环，但用 assigned_ranks[block_idx++] 替代 n_rank++，Block 构造时直接拿到正确的 rank，内存分配在对的进程上 yx-mpi jaunatisblue 2026-02-12 03:22:46 +08:00
93362baee5 修改transfer jaunatisblue 2026-02-12 00:58:18 +08:00
2791d2e225 Merge pull request 'PGO updated' (#1) from cjy-oneapi-opus-hotfix into main gh0s7 2026-02-11 19:17:35 +08:00
72ce153e48 Merge cjy-oneapi-opus-hotfix into main CGH0S7 2026-02-11 19:15:12 +08:00
5b7e05cd32 PGO updated CGH0S7 2026-02-11 18:26:30 +08:00
85afe00fc5 Merge plotting optimizations from chb-copilot-test CGH0S7 2026-02-11 16:19:17 +08:00
5c1790277b Replace nested OutBdLow2Hi loops with batch calls in RestrictProlong CGH0S7 2026-02-11 16:09:08 +08:00
714c6e90c6 Add OpenMP parallelization to Fortran compute kernels cjy-oneapi-opus-windfall CGH0S7 2026-02-10 23:40:17 +08:00
caf192b2e4 Remove MPI dependency, replace with single-process stub for non-MPI builds CGH0S7 2026-02-10 22:51:11 +08:00
e09ae438a2 Cache data_packer lengths in Sync_start to skip redundant buffer-size traversals CGH0S7 2026-02-10 21:39:22 +08:00
d06d5b4db8 Add targeted point-to-point Interp_Points overload for surface_integral CGH0S7 2026-02-10 19:18:56 +08:00
8b68b5d782 fixup! Fix load explosion: use subprocess for binary data plots to avoid thread conflict chb-copilot-test ianchb 2026-02-09 22:57:17 +08:00
50e2a845f8 Replace MPI_Allreduce with owner-rank MPI_Bcast in Patch::Interp_Points CGH0S7 2026-02-09 22:39:18 +08:00
738498cb28 Optimize MPI communication in RestrictProlong and surface_integral CGH0S7 2026-02-09 22:07:12 +08:00
dd2443c926 Fix load explosion: use subprocess for binary data plots to avoid thread conflict ianchb 2026-02-09 21:40:27 +08:00
2d7ba5c60c [2/2] Implement multiprocessing-based parallel plotting ianchb 2026-02-09 09:31:55 +00:00
42b9cf1ad9 Optimize MPI Sync with merged transfers, caching, and async overlap CGH0S7 2026-02-09 21:03:37 +08:00
4777cad4ed [1/2] Implement multiprocessing-based parallel plotting ianchb 2026-02-09 15:13:18 +08:00
e9d321fd00 Convert MPI_Allreduce error checks to non-blocking MPI_Iallreduce overlapped with Sync CGH0S7 2026-02-09 12:39:29 +08:00
ed1d86ade9 Merge paired MPI_Allreduce error checks to reduce global sync barriers CGH0S7 2026-02-09 12:12:16 +08:00
471baa5065 PGO supported CGH0S7 2026-02-09 10:59:26 +08:00
86704100ec Only enable OpenMP for TwoPunctures chb-twopunctures ianchb 2026-02-08 13:00:37 +08:00
291d40c04b Use OpenMP's parallel for with schedule(dynamic,1) ianchb 2026-02-07 19:04:51 +08:00
32ed7ec5bd Optimize memory allocation in JFD_times_dv ianchb 2026-02-07 15:55:45 +08:00
c5f8a18ba4 对lopsided和kodis进行合并，减少symmetry_bd开销，有0.01~0.02s单步效果 jaunatisblue 2026-02-08 23:21:54 +08:00
afd4006da2 Cache GSL in SyncPlan and apply async Sync to Z4c_class ianchb 2026-02-08 08:36:21 +00:00

1 2 3 4 5