Commit Graph

  • 23a82d063b 对prolong3做访存优化 jaunatisblue 2026-03-02 01:16:10 +08:00
  • 672b7ebee2 修改prolong jaunatisblue 2026-03-02 02:01:07 +08:00
  • 63bf180159 对prolong3做访存优化 jaunatisblue 2026-03-02 01:16:10 +08:00
  • 524d1d1512 Merge pull request 'cjy-dystopia' (#2) from cjy-dystopia into main gh0s7 2026-03-01 19:22:09 +08:00
  • 44efb2e08c 预赛最终版本v1.0.0: 确定PGO和原负载均衡方案在当前版本造成负优化已经回退 CGH0S7 2026-03-01 18:04:25 +08:00
  • 16013081e0 Optimize symmetry_bd with stride-based fast paths CGH0S7 2026-03-01 15:50:56 +08:00
  • e7a02e8f72 perf(polint): add uniform-grid fast path for barycentric n=6 CGH0S7 2026-03-01 13:26:39 +08:00
  • 8dad910c6c perf(polint): add switchable barycentric ordn=6 path CGH0S7 2026-03-01 13:20:46 +08:00
  • 01b4cf71d1 perf(polin3): switch to lagrange-weight tensor contraction CGH0S7 2026-03-01 13:04:33 +08:00
  • 66dabe8cc4 perf(polint): add ordn=6 specialized neville path CGH0S7 2026-03-01 12:39:53 +08:00
  • 03416a7b28 perf(polint): add uniform-grid fast path for barycentric n=6 CGH0S7 2026-03-01 13:26:39 +08:00
  • cca3c16c2b perf(polint): add switchable barycentric ordn=6 path CGH0S7 2026-03-01 13:20:46 +08:00
  • e5231849ee perf(polin3): switch to lagrange-weight tensor contraction CGH0S7 2026-03-01 13:04:33 +08:00
  • a766e49ff0 perf(polint): add ordn=6 specialized neville path CGH0S7 2026-03-01 12:39:53 +08:00
  • 19b0e79692 黄老板逆天重写 hxh-new wingrew 2026-03-01 05:48:40 +08:00
  • 1a518cd3f6 Optimize average2: use DO CONCURRENT loop form CGH0S7 2026-03-01 00:41:32 +08:00
  • 1dc622e516 Optimize average2: replace array expression with explicit loops CGH0S7 2026-03-01 00:33:01 +08:00
  • 3046a0ccde Optimize prolong3: hoist bounds check out of inner loop CGH0S7 2026-03-01 00:17:30 +08:00
  • d4ec69c98a Optimize prolong3: replace parity branches with coefficient lookup CGH0S7 2026-02-28 23:59:57 +08:00
  • 2c0a3055d4 Optimize prolong3: precompute coarse index/parity maps CGH0S7 2026-02-28 23:53:30 +08:00
  • 1eba73acbe 先关闭绑核心,发现速度对比:不绑定核心+SCX>绑核心+SCX CGH0S7 2026-02-28 23:27:44 +08:00
  • 588fb675a0 尝试划分4block但是效果不好,转为研究访存 yx_new_split jaunatisblue 2026-02-28 21:17:02 +08:00
  • b91cfff301 Add switchable C RK4 kernel and build toggle CGH0S7 2026-02-28 21:12:19 +08:00
  • e29ca2dca9 build: switch allocator option to oneTBB tbbmalloc CGH0S7 2026-02-28 17:16:00 +08:00
  • 6493101ca0 bssn_rhs_c: recompute contracted Gamma terms to remove temp arrays CGH0S7 2026-02-28 16:34:23 +08:00
  • 169986cde1 bssn_rhs_c: compute div_beta on-the-fly to remove temp array CGH0S7 2026-02-28 16:25:57 +08:00
  • 1fbc213888 bssn_rhs_c: remove gxx/gyy/gzz temporaries in favor of dxx/dyy/dzz+1 CGH0S7 2026-02-28 15:50:52 +08:00
  • 6024708a48 derivs_c: split low/high stencil regions to reduce branch overhead CGH0S7 2026-02-28 15:42:31 +08:00
  • abf2f640e4 add fused symmetry packing kernels for orders 2 and 3 in BSSN RHS ianchb 2026-02-28 15:35:14 +08:00
  • 94f40627aa refine GPU dispatch initialization and optimize H2D/D2H data transfers ianchb 2026-02-28 15:23:41 +08:00
  • bc457d981e bssn_rhs_c: merge lopsided+kodis with shared symmetry buffer CGH0S7 2026-02-28 15:23:01 +08:00
  • 51dead090e bssn_rhs_c: 融合最终RHS两循环为一循环,用局部变量传递fij中间值 (Modify 6) CGH0S7 2026-02-28 13:49:45 +08:00
  • 34d6922a66 fdderivs_c: 全量清零改为只清零边界面,减少无效内存写入 CGH0S7 2026-02-28 13:20:06 +08:00
  • 8010ad27ed kodiss_c: 收紧循环范围消除边界无用迭代和分支判断 CGH0S7 2026-02-28 13:04:21 +08:00
  • 38e691f013 bssn_rhs_c: 融合Christoffel修正+trK_rhs两循环为一循环 (Modify 5) CGH0S7 2026-02-28 12:57:07 +08:00
  • 808387aa11 bssn_rhs_c: 融合fxx/Gamxa+Gamma_rhs_part2两循环为一循环 (Modify 4) CGH0S7 2026-02-28 11:14:35 +08:00
  • d94c31c5c4 [WIP]Implement multi-GPU support in BSSN RHS and add profiling for H2D/D2H transfers ianchb 2026-02-28 01:21:45 +08:00
  • 724e9cd415 [WIP]Add CUDA support for BSSN RHS with new kernel and update makefiles ianchb 2026-02-27 21:46:43 +08:00
  • c001939461 Add Lagrange interpolation subroutine and update calls in prolongrestrict modules ianchb 2026-02-27 12:46:39 +08:00
  • 94d236385d Revert "skip redundant MPI ghost cell syncs for stages 0, 1 & 2" ianchb 2026-02-26 20:56:41 +08:00
  • 780f1c80d0 skip redundant MPI ghost cell syncs for stages 0, 1 & 2 ianchb 2026-02-26 16:16:33 +08:00
  • c2b676abf2 bssn_rhs_c: 融合A^{ij}升指标+Gamma_rhs_part1两循环为一循环 (Modify 3) CGH0S7 2026-02-28 11:02:27 +08:00
  • 2c60533501 bssn_rhs_c: 融合逆度规+Gamma约束+Christoffel三循环为一循环 (Modify 2) CGH0S7 2026-02-28 10:57:40 +08:00
  • aabe74c098 短暂的4划分但是以失败告终 jaunatisblue 2026-02-28 08:23:30 +08:00
  • 318b5254cc 根据组委会邮件要求更新检测脚本,增加对3D向量和三个分量分别检测RMS小于1.0% CGH0S7 2026-02-27 17:38:21 +08:00
  • 3cee05f262 Merge branch 'cjy-oneapi-opus-hotfix' CGH0S7 2026-02-27 15:13:40 +08:00
  • e0b5e012df 引入 PGO 式两遍编译流程,将 Interp_Points 负载均衡优化合法化 cjy-oneapi-opus-hotfix CGH0S7 2026-02-27 15:10:22 +08:00
  • 6b2464b80c Interp_Points 负载均衡:热点 block 拆分与 rank 重映射 jaunatisblue 2026-02-27 15:07:40 +08:00
  • 9c33e16571 增加C算子PGO文件 CGH0S7 2026-02-27 11:30:36 +08:00
  • f7ada421cf skip redundant MPI ghost cell syncs for stages 0, 1 & 2 chb-replace ianchb 2026-02-26 16:16:33 +08:00
  • 45b7a43576 补全C算子和Fortran算子的数学差异 CGH0S7 2026-02-26 15:48:11 +08:00
  • dfb79e3e11 Initialize output arrays to zero in fdderivs_c.C and fderivs_c.C ianchb 2026-02-26 11:48:28 +08:00
  • fb9f153662 Initialize output arrays to zero in fdderivs_c.C and fderivs_c.C ianchb 2026-02-26 11:48:28 +08:00
  • f147f79ffa 修改block划分,对负载高的rank所在block进行划分,添加到空rank,空rank是平移得到的 yx-vacation jaunatisblue 2026-02-26 09:40:46 +08:00
  • d2c2214fa1 补充TwoPunctureABE专用PGO插桩文件 CGH0S7 2026-02-25 23:06:17 +08:00
  • e157ea3a23 合并 chb-replace:C++ 算子替换 Fortran bssn_rhs,添加回退开关与独立 PGO profdata CGH0S7 2026-02-25 22:50:46 +08:00
  • f5a63f1e42 Revert "Fix timing: replace clock() with MPI_Wtime() for wall-clock measurement" ianchb 2026-02-25 22:21:43 +08:00
  • 284ab80baf Remove OpenMP from C rewrite kernel ianchb 2026-02-25 13:15:24 +00:00
  • 09b937c022 Fix timing: replace clock() with MPI_Wtime() for wall-clock measurement copilot-swe-agent[bot] 2026-02-25 12:42:47 +00:00
  • 8a9c775705 Replace Fortran bssn_rhs with C implementation and add C helper kernels wingrew 2026-02-25 18:59:33 +08:00
  • d942122043 更新PGO文件 CGH0S7 2026-02-25 18:25:20 +08:00
  • a5c713a7e0 完善PGO机制 CGH0S7 2026-02-25 17:22:56 +08:00
  • 9e6b25163a 更新 PGO profdata 并为 ABE 插桩编译添加 PGO_MODE 开关 CGH0S7 2026-02-25 17:00:55 +08:00
  • efc8bf29ea 按需失效同步缓存:Regrid_Onelevel 改为返回 bool CGH0S7 2026-02-25 16:00:26 +08:00
  • ccf6adaf75 提供正确的macrodef.h避免llm被误导 CGH0S7 2026-02-25 11:47:14 +08:00
  • e2bc472845 优化绑核逻辑,取消硬编码改为智能识别 CGH0S7 2026-02-25 10:59:32 +08:00
  • 8abac8dd88 对rank运行时间统计,两个函数分别在不同的计算中被调用,因此我对两个重载的函数分别进行了mpi实际计算时间的统计,对于第一个PatList_Interp_Points 调用 Interp_points,我取排名前三的rank时间,发现每次只有一个rank时间较长,Rank [ 52]: Calc 0.000012 s jaunatisblue 2026-02-24 14:33:04 +08:00
  • e6329b013d Merge branch 'cjy-oneapi-opus-hotfix' CGH0S7 2026-02-20 14:18:33 +08:00
  • 82339f5282 Merge lopsided advection + kodis dissipation to share symmetry_bd buffer ianchb 2026-02-20 09:45:37 +08:00
  • 94f38c57f9 Don't hardcode pgo profile path ianchb 2026-02-20 08:48:25 +08:00
  • cc06e30404 Apply async Sync optimization to Z4c_class using Sync_start/finish pattern chb-new ianchb 2026-02-20 09:50:40 +08:00
  • 25c79dc7cd Merge lopsided advection + kodis dissipation to share symmetry_bd buffer ianchb 2026-02-20 09:45:37 +08:00
  • a725d34dd3 Don't hardcode pgo profile path ianchb 2026-02-20 08:48:25 +08:00
  • 85d1e8de87 Add Intel SIMD vectorization directives to hot-spot functions CGH0S7 2026-02-14 00:43:39 +08:00
  • b32675ba99 1. Pass 1(357-395行):遍历所有 Patch,对每个 block 计算含ghost zone 的实际体积,存入 block_volumes 2. Greedy LPT(397-414行):按体积从大到小排序,依次分配给当前负载最小的 rank 3. Pass 2(416-555行):原来的 block创建循环,但用 assigned_ranks[block_idx++] 替代 n_rank++,Block 构造时直接拿到正确的 rank,内存分配在对的进程上 yx-mpi jaunatisblue 2026-02-12 03:22:46 +08:00
  • 93362baee5 修改transfer jaunatisblue 2026-02-12 00:58:18 +08:00
  • 2791d2e225 Merge pull request 'PGO updated' (#1) from cjy-oneapi-opus-hotfix into main gh0s7 2026-02-11 19:17:35 +08:00
  • 72ce153e48 Merge cjy-oneapi-opus-hotfix into main CGH0S7 2026-02-11 19:15:12 +08:00
  • 5b7e05cd32 PGO updated CGH0S7 2026-02-11 18:26:30 +08:00
  • 85afe00fc5 Merge plotting optimizations from chb-copilot-test CGH0S7 2026-02-11 16:19:17 +08:00
  • 5c1790277b Replace nested OutBdLow2Hi loops with batch calls in RestrictProlong CGH0S7 2026-02-11 16:09:08 +08:00
  • 714c6e90c6 Add OpenMP parallelization to Fortran compute kernels cjy-oneapi-opus-windfall CGH0S7 2026-02-10 23:40:17 +08:00
  • caf192b2e4 Remove MPI dependency, replace with single-process stub for non-MPI builds CGH0S7 2026-02-10 22:51:11 +08:00
  • e09ae438a2 Cache data_packer lengths in Sync_start to skip redundant buffer-size traversals CGH0S7 2026-02-10 21:39:22 +08:00
  • d06d5b4db8 Add targeted point-to-point Interp_Points overload for surface_integral CGH0S7 2026-02-10 19:18:56 +08:00
  • 8b68b5d782 fixup! Fix load explosion: use subprocess for binary data plots to avoid thread conflict chb-copilot-test ianchb 2026-02-09 22:57:17 +08:00
  • 50e2a845f8 Replace MPI_Allreduce with owner-rank MPI_Bcast in Patch::Interp_Points CGH0S7 2026-02-09 22:39:18 +08:00
  • 738498cb28 Optimize MPI communication in RestrictProlong and surface_integral CGH0S7 2026-02-09 22:07:12 +08:00
  • dd2443c926 Fix load explosion: use subprocess for binary data plots to avoid thread conflict ianchb 2026-02-09 21:40:27 +08:00
  • 2d7ba5c60c [2/2] Implement multiprocessing-based parallel plotting ianchb 2026-02-09 09:31:55 +00:00
  • 42b9cf1ad9 Optimize MPI Sync with merged transfers, caching, and async overlap CGH0S7 2026-02-09 21:03:37 +08:00
  • 4777cad4ed [1/2] Implement multiprocessing-based parallel plotting ianchb 2026-02-09 15:13:18 +08:00
  • e9d321fd00 Convert MPI_Allreduce error checks to non-blocking MPI_Iallreduce overlapped with Sync CGH0S7 2026-02-09 12:39:29 +08:00
  • ed1d86ade9 Merge paired MPI_Allreduce error checks to reduce global sync barriers CGH0S7 2026-02-09 12:12:16 +08:00
  • 471baa5065 PGO supported CGH0S7 2026-02-09 10:59:26 +08:00
  • 86704100ec Only enable OpenMP for TwoPunctures chb-twopunctures ianchb 2026-02-08 13:00:37 +08:00
  • 291d40c04b Use OpenMP's parallel for with schedule(dynamic,1) ianchb 2026-02-07 19:04:51 +08:00
  • 32ed7ec5bd Optimize memory allocation in JFD_times_dv ianchb 2026-02-07 15:55:45 +08:00
  • c5f8a18ba4 对lopsided和kodis进行合并,减少symmetry_bd开销,有0.01~0.02s单步效果 jaunatisblue 2026-02-08 23:21:54 +08:00
  • afd4006da2 Cache GSL in SyncPlan and apply async Sync to Z4c_class ianchb 2026-02-08 08:36:21 +00:00