Files
kernels/tests
Hansung Kim 21b6655c10 sgemm_impl: Implement fast coalesced wmma_store
Enables a fairer comparison between core-coupled tensor core to Hopper
tensor core, where the latter benefits from coalesced full-throughput
moveout to GMEM because it does not use the 1x2 interleaved register
mapping.  This means the result matrix will be stored swizzled in the
GMEM, without breaking correctness.
2024-10-29 22:34:22 -07:00
..
2023-11-10 02:47:05 -08:00
2023-11-11 15:49:39 -08:00
2024-03-24 01:47:00 -07:00
2023-11-10 02:47:05 -08:00