Commit Graph

2688 Commits

Author SHA1 Message Date
Hansung Kim
8fe6d918f2 flash: Update tcore kernel to use new CISC 2024-11-09 19:49:20 -08:00
Hansung Kim
76a6aaf085 flash: doc update 2024-11-09 19:09:09 -08:00
Hansung Kim
673e07ed43 flash: Add non-warp-specialized gemmini flash kernel 2024-11-09 19:08:39 -08:00
Hansung Kim
ac42f2dbba sgemm_gemmini_dma: Update with new compute API 2024-11-09 16:49:39 -08:00
Hansung Kim
ad75561efe flash: Reduce fence calls to improve util 2024-11-09 16:44:17 -08:00
Hansung Kim
6990fcc1e6 Add compute-and-mvout-to-spad API 2024-11-09 16:43:45 -08:00
Hansung Kim
952b8debbb flash: Update to use new CISC interface 2024-11-09 16:21:34 -08:00
Hansung Kim
dc89309ad0 Merge branch 'kernels-flash' into new-cisc 2024-11-09 14:42:46 -08:00
Hansung Kim
365b1d8e67 flash: Add begin end markers 2024-11-09 10:19:16 -08:00
Hansung Kim
1e3d476e70 Switch header configs to flash 2024-11-08 21:56:42 -08:00
Richard Yan
c114a7a4ab new gemm kernel 2024-11-08 20:55:27 -08:00
Hansung Kim
4e087a8aab flash: Fix loop iteration for gemmini
Kernel is software-pipelined around 2 GEMMs and softmax; it requires two
iterations to fully complete a tile.
2024-11-08 16:43:08 -08:00
Hansung Kim
4055255018 flash: Fix tcore kernel for CISC arg field changes 2024-11-08 16:40:16 -08:00
Hansung Kim
c001618fb9 sgemm_impl: Fix wrong next block_m logic for DMA 2024-10-29 22:35:56 -07:00
Hansung Kim
21b6655c10 sgemm_impl: Implement fast coalesced wmma_store
Enables a fairer comparison between core-coupled tensor core to Hopper
tensor core, where the latter benefits from coalesced full-throughput
moveout to GMEM because it does not use the 1x2 interleaved register
mapping.  This means the result matrix will be stored swizzled in the
GMEM, without breaking correctness.
2024-10-29 22:34:22 -07:00
Hansung Kim
6b39a6fe70 Add convenience script for switching input/args binaries 2024-10-29 20:14:33 -07:00
Hansung Kim
8dadbdd42d tensor: Do DMA mvin for next m/n loop at the last k iter
This increases util by pulling the DMA wait time out of the K-loop
wraparound (next N) and overlapping it with the last K iter.
2024-10-29 19:43:22 -07:00
Hansung Kim
367fa927f8 sgemm_impl: Fix default FP_SIZE to 16 2024-10-29 14:56:18 -07:00
Hansung Kim
e1b0fc3944 generate_matrix.py: Rand [0,1); also save non-swizzled row-major B 2024-10-29 14:55:32 -07:00
Hansung Kim
24064dc7a2 sgemm_impl: Do proper addr gen and store for wgmma 2024-10-29 01:31:55 -07:00
Hansung Kim
bd7a8e39b9 sgemm_impl: Split out smem addr gen to functions
so that the addr gen code can also be used for wgmma.
2024-10-29 01:30:48 -07:00
Hansung Kim
ae98ae6e93 sgemm_tcore: Fix DMA smem addresses, add markers
Take into account that DMA writes B tiles starting from the end of the
quartile.
2024-10-28 17:26:07 -07:00
Hansung Kim
b4dadfaf61 Merge remote-tracking branch 'origin/kernels' into kernels-hopper 2024-10-28 14:25:18 -07:00
Richard Yan
fd1c9f4729 update gemmini dma kernel 2024-10-28 13:47:13 -07:00
Hansung Kim
d0421426be sgemm_tcore: Hardcode CISC spadQuartile addresses 2024-10-28 12:49:22 -07:00
Hansung Kim
e55c8b480e sgemm_impl: Comment out GEMMINI_DMA code in single_tile
This is already done in the higher-level thread_block_gemm function, and
flash also has explicit DMA sync code.  Also having this executed twice
sometimes triggers vx_bar movement into a branch which we really want to
avoid.
2024-10-28 12:47:49 -07:00
Hansung Kim
36eb50060f sgemm_impl: Add skeleton wgmma routine for single_tile 2024-10-28 12:47:20 -07:00
Hansung Kim
e8a943e893 Push golden sgemm_tcore kernels 2024-10-27 19:40:08 -07:00
Richard Yan
379d863456 Merge branch 'kernels' of https://github.com/hansungk/vortex-private into kernels 2024-10-24 17:31:01 -07:00
Richard Yan
fb928b5cda new unaligned access kernel, update idle kernel 2024-10-24 17:28:59 -07:00
Richard Yan
8cc0c3bae4 fp16 no dma kernel 2024-10-24 17:12:34 -07:00
Richard Yan
04a64dee7a fp16 dma kernel 2024-10-24 17:12:02 -07:00
Hansung Kim
6417a625b1 sgemm_impl: Add tiling params for hopper tensor core 2024-10-23 19:50:18 -07:00
Hansung Kim
68cd6455fe sgemm_impl: Add mmio reconverge barrier to avoid slip-off; switch to FP32 2024-10-02 15:17:44 -07:00
Hansung Kim
34d0956cd5 tensor: Attempt row-major mapping for C store (WIP)
Doesn't work because 1x2 jagged mapping is required to achieve
throughput for storing the bigger C matrix (2x4, vs. 2x2 in A).
2024-10-02 15:14:55 -07:00
Hansung Kim
3490294626 generate_matrix.py: switch to fp16 rand, generate row-major A 2024-10-02 11:01:23 -07:00
Hansung Kim
db2789bf23 Add asm label for cisc compute 2024-10-02 10:59:14 -07:00
Hansung Kim
6f6ee5616f Add convergent attribute to vx_barrier
Note this attribute is only supported by Clang, so this will only be
applied to the kernel binary but not runtime.
2024-10-02 10:57:45 -07:00
Hansung Kim
221d5f75c2 flash: Optimize smem alloc for tcore for 8banks
Divide into first half & last half for warpgroup 0 & 1, and
allocate Q/K and P/V in different banks for parallel acccess.
2024-09-19 21:31:39 -07:00
Hansung Kim
d0ef06cec1 flash: Complete Q_IS_K_MAJOR code for GEMM II 2024-09-19 20:36:03 -07:00
Hansung Kim
b9cafd6372 idle: unused const 2024-09-18 18:10:29 -07:00
Hansung Kim
be15cffbf3 flash: Revert to gemmini config, remove DEBUG and unnecessary checks 2024-09-12 14:25:33 -07:00
Hansung Kim
b5916f3f07 flash: Fix hardcoded barrier for tcore; move tcore-specific flags 2024-09-11 22:08:44 -07:00
Hansung Kim
d69707f686 flash: Enable GEMM II fence; Pull 1st KV move out of the loop 2024-09-11 19:24:06 -07:00
Hansung Kim
18cf0e73cd flash: Add early return for warp-indivisible row iter 2024-09-11 00:56:09 -07:00
Hansung Kim
068d48534e flash: Swap S1/S0 to avoid GEMM II - softmax bank conflict
+ remove spurrious fences to better overlap GEMM I and DMA
2024-09-11 00:55:36 -07:00
Hansung Kim
ba66d2c2bd sgemm_impl: barrier dumb dumb 2024-09-11 00:01:56 -07:00
Hansung Kim
dc746272fb flash: Conditionally enable GEMM II fence code, fix tile_k for DEBUG 2024-09-10 22:53:35 -07:00
Hansung Kim
28b2eaec8f sgemm_gemmini_dma: Fix tile size to (128,64,128) 2024-09-10 18:29:40 -07:00
Hansung Kim
2152c80ffd sgemm_impl: Add missing reconvergence barrier after mmio 2024-09-10 18:05:01 -07:00