Hansung Kim
e4c0bbd039
sgemm: Check-in argument binaries
2025-01-28 16:04:56 -08:00
Hansung Kim
45e9407c99
sgemm: Check-in argument binaries
2025-01-28 15:58:27 -08:00
Hansung Kim
9894efe6c9
Update toolchain env paths for dork
2025-01-28 15:04:14 -08:00
Hansung Kim
5ef4c8023e
sgemm_impl: Disable wmma fast store
...
Doesn't seem to have a big impact on tcore util.
2024-11-11 14:06:15 -08:00
Hansung Kim
7d7cb5f60a
flash: Disable perf loop multiplier
2024-11-10 22:44:02 -08:00
Hansung Kim
4448f31fdc
fence: Fix moving fence to start of loop
...
For unknown reasons, guarding the fence with a tid == 0 branch causes a
TL source ID re-used assertion. Just call the fence from all
thread/warps as a workaround. At least, all threads in a warp will
coalesce into one request.
2024-11-09 22:04:45 -08:00
Hansung Kim
cb916ead39
Fix potential bitwidth bug in compute API
2024-11-09 20:59:58 -08:00
Hansung Kim
68054689c9
flash: Move fence to start of loop; wrap all MMIO in one tid=0 branch
2024-11-09 20:59:26 -08:00
Hansung Kim
fcd8b0b892
flash: Disable rescale flag check
...
GEMM-II finishes much earlier than softmax for this to be a problem.
2024-11-09 20:37:58 -08:00
Hansung Kim
1c9b022156
flash: Rename nowarpspec to default
2024-11-09 19:58:45 -08:00
Hansung Kim
8fe6d918f2
flash: Update tcore kernel to use new CISC
2024-11-09 19:49:20 -08:00
Hansung Kim
76a6aaf085
flash: doc update
2024-11-09 19:09:09 -08:00
Hansung Kim
673e07ed43
flash: Add non-warp-specialized gemmini flash kernel
2024-11-09 19:08:39 -08:00
Hansung Kim
ac42f2dbba
sgemm_gemmini_dma: Update with new compute API
2024-11-09 16:49:39 -08:00
Hansung Kim
ad75561efe
flash: Reduce fence calls to improve util
2024-11-09 16:44:17 -08:00
Hansung Kim
6990fcc1e6
Add compute-and-mvout-to-spad API
2024-11-09 16:43:45 -08:00
Hansung Kim
952b8debbb
flash: Update to use new CISC interface
2024-11-09 16:21:34 -08:00
Hansung Kim
dc89309ad0
Merge branch 'kernels-flash' into new-cisc
2024-11-09 14:42:46 -08:00
Hansung Kim
365b1d8e67
flash: Add begin end markers
2024-11-09 10:19:16 -08:00
Hansung Kim
1e3d476e70
Switch header configs to flash
2024-11-08 21:56:42 -08:00
Richard Yan
c114a7a4ab
new gemm kernel
2024-11-08 20:55:27 -08:00
Hansung Kim
4e087a8aab
flash: Fix loop iteration for gemmini
...
Kernel is software-pipelined around 2 GEMMs and softmax; it requires two
iterations to fully complete a tile.
2024-11-08 16:43:08 -08:00
Hansung Kim
4055255018
flash: Fix tcore kernel for CISC arg field changes
2024-11-08 16:40:16 -08:00
Hansung Kim
c001618fb9
sgemm_impl: Fix wrong next block_m logic for DMA
2024-10-29 22:35:56 -07:00
Hansung Kim
21b6655c10
sgemm_impl: Implement fast coalesced wmma_store
...
Enables a fairer comparison between core-coupled tensor core to Hopper
tensor core, where the latter benefits from coalesced full-throughput
moveout to GMEM because it does not use the 1x2 interleaved register
mapping. This means the result matrix will be stored swizzled in the
GMEM, without breaking correctness.
2024-10-29 22:34:22 -07:00
Hansung Kim
6b39a6fe70
Add convenience script for switching input/args binaries
2024-10-29 20:14:33 -07:00
Hansung Kim
8dadbdd42d
tensor: Do DMA mvin for next m/n loop at the last k iter
...
This increases util by pulling the DMA wait time out of the K-loop
wraparound (next N) and overlapping it with the last K iter.
2024-10-29 19:43:22 -07:00
Hansung Kim
367fa927f8
sgemm_impl: Fix default FP_SIZE to 16
2024-10-29 14:56:18 -07:00
Hansung Kim
e1b0fc3944
generate_matrix.py: Rand [0,1); also save non-swizzled row-major B
2024-10-29 14:55:32 -07:00
Hansung Kim
24064dc7a2
sgemm_impl: Do proper addr gen and store for wgmma
2024-10-29 01:31:55 -07:00
Hansung Kim
bd7a8e39b9
sgemm_impl: Split out smem addr gen to functions
...
so that the addr gen code can also be used for wgmma.
2024-10-29 01:30:48 -07:00
Hansung Kim
ae98ae6e93
sgemm_tcore: Fix DMA smem addresses, add markers
...
Take into account that DMA writes B tiles starting from the end of the
quartile.
2024-10-28 17:26:07 -07:00
Hansung Kim
b4dadfaf61
Merge remote-tracking branch 'origin/kernels' into kernels-hopper
2024-10-28 14:25:18 -07:00
Richard Yan
fd1c9f4729
update gemmini dma kernel
2024-10-28 13:47:13 -07:00
Hansung Kim
d0421426be
sgemm_tcore: Hardcode CISC spadQuartile addresses
2024-10-28 12:49:22 -07:00
Hansung Kim
e55c8b480e
sgemm_impl: Comment out GEMMINI_DMA code in single_tile
...
This is already done in the higher-level thread_block_gemm function, and
flash also has explicit DMA sync code. Also having this executed twice
sometimes triggers vx_bar movement into a branch which we really want to
avoid.
2024-10-28 12:47:49 -07:00
Hansung Kim
36eb50060f
sgemm_impl: Add skeleton wgmma routine for single_tile
2024-10-28 12:47:20 -07:00
Hansung Kim
e8a943e893
Push golden sgemm_tcore kernels
2024-10-27 19:40:08 -07:00
Richard Yan
379d863456
Merge branch 'kernels' of https://github.com/hansungk/vortex-private into kernels
2024-10-24 17:31:01 -07:00
Richard Yan
fb928b5cda
new unaligned access kernel, update idle kernel
2024-10-24 17:28:59 -07:00
Richard Yan
8cc0c3bae4
fp16 no dma kernel
2024-10-24 17:12:34 -07:00
Richard Yan
04a64dee7a
fp16 dma kernel
2024-10-24 17:12:02 -07:00
Hansung Kim
6417a625b1
sgemm_impl: Add tiling params for hopper tensor core
2024-10-23 19:50:18 -07:00
Hansung Kim
68cd6455fe
sgemm_impl: Add mmio reconverge barrier to avoid slip-off; switch to FP32
2024-10-02 15:17:44 -07:00
Hansung Kim
34d0956cd5
tensor: Attempt row-major mapping for C store (WIP)
...
Doesn't work because 1x2 jagged mapping is required to achieve
throughput for storing the bigger C matrix (2x4, vs. 2x2 in A).
2024-10-02 15:14:55 -07:00
Hansung Kim
3490294626
generate_matrix.py: switch to fp16 rand, generate row-major A
2024-10-02 11:01:23 -07:00
Hansung Kim
db2789bf23
Add asm label for cisc compute
2024-10-02 10:59:14 -07:00
Hansung Kim
6f6ee5616f
Add convergent attribute to vx_barrier
...
Note this attribute is only supported by Clang, so this will only be
applied to the kernel binary but not runtime.
2024-10-02 10:57:45 -07:00
Hansung Kim
221d5f75c2
flash: Optimize smem alloc for tcore for 8banks
...
Divide into first half & last half for warpgroup 0 & 1, and
allocate Q/K and P/V in different banks for parallel acccess.
2024-09-19 21:31:39 -07:00
Hansung Kim
d0ef06cec1
flash: Complete Q_IS_K_MAJOR code for GEMM II
2024-09-19 20:36:03 -07:00