Richard Yan
c114a7a4ab
new gemm kernel
2024-11-08 20:55:27 -08:00
Hansung Kim
4e087a8aab
flash: Fix loop iteration for gemmini
...
Kernel is software-pipelined around 2 GEMMs and softmax; it requires two
iterations to fully complete a tile.
2024-11-08 16:43:08 -08:00
Hansung Kim
4055255018
flash: Fix tcore kernel for CISC arg field changes
2024-11-08 16:40:16 -08:00
Hansung Kim
c001618fb9
sgemm_impl: Fix wrong next block_m logic for DMA
2024-10-29 22:35:56 -07:00
Hansung Kim
21b6655c10
sgemm_impl: Implement fast coalesced wmma_store
...
Enables a fairer comparison between core-coupled tensor core to Hopper
tensor core, where the latter benefits from coalesced full-throughput
moveout to GMEM because it does not use the 1x2 interleaved register
mapping. This means the result matrix will be stored swizzled in the
GMEM, without breaking correctness.
2024-10-29 22:34:22 -07:00
Hansung Kim
6b39a6fe70
Add convenience script for switching input/args binaries
2024-10-29 20:14:33 -07:00
Hansung Kim
8dadbdd42d
tensor: Do DMA mvin for next m/n loop at the last k iter
...
This increases util by pulling the DMA wait time out of the K-loop
wraparound (next N) and overlapping it with the last K iter.
2024-10-29 19:43:22 -07:00
Hansung Kim
367fa927f8
sgemm_impl: Fix default FP_SIZE to 16
2024-10-29 14:56:18 -07:00
Hansung Kim
e1b0fc3944
generate_matrix.py: Rand [0,1); also save non-swizzled row-major B
2024-10-29 14:55:32 -07:00
Hansung Kim
24064dc7a2
sgemm_impl: Do proper addr gen and store for wgmma
2024-10-29 01:31:55 -07:00
Hansung Kim
bd7a8e39b9
sgemm_impl: Split out smem addr gen to functions
...
so that the addr gen code can also be used for wgmma.
2024-10-29 01:30:48 -07:00
Hansung Kim
ae98ae6e93
sgemm_tcore: Fix DMA smem addresses, add markers
...
Take into account that DMA writes B tiles starting from the end of the
quartile.
2024-10-28 17:26:07 -07:00
Hansung Kim
b4dadfaf61
Merge remote-tracking branch 'origin/kernels' into kernels-hopper
2024-10-28 14:25:18 -07:00
Richard Yan
fd1c9f4729
update gemmini dma kernel
2024-10-28 13:47:13 -07:00
Hansung Kim
d0421426be
sgemm_tcore: Hardcode CISC spadQuartile addresses
2024-10-28 12:49:22 -07:00
Hansung Kim
e55c8b480e
sgemm_impl: Comment out GEMMINI_DMA code in single_tile
...
This is already done in the higher-level thread_block_gemm function, and
flash also has explicit DMA sync code. Also having this executed twice
sometimes triggers vx_bar movement into a branch which we really want to
avoid.
2024-10-28 12:47:49 -07:00
Hansung Kim
36eb50060f
sgemm_impl: Add skeleton wgmma routine for single_tile
2024-10-28 12:47:20 -07:00
Hansung Kim
e8a943e893
Push golden sgemm_tcore kernels
2024-10-27 19:40:08 -07:00
Richard Yan
379d863456
Merge branch 'kernels' of https://github.com/hansungk/vortex-private into kernels
2024-10-24 17:31:01 -07:00
Richard Yan
fb928b5cda
new unaligned access kernel, update idle kernel
2024-10-24 17:28:59 -07:00
Richard Yan
8cc0c3bae4
fp16 no dma kernel
2024-10-24 17:12:34 -07:00
Richard Yan
04a64dee7a
fp16 dma kernel
2024-10-24 17:12:02 -07:00
Hansung Kim
6417a625b1
sgemm_impl: Add tiling params for hopper tensor core
2024-10-23 19:50:18 -07:00
Hansung Kim
68cd6455fe
sgemm_impl: Add mmio reconverge barrier to avoid slip-off; switch to FP32
2024-10-02 15:17:44 -07:00
Hansung Kim
34d0956cd5
tensor: Attempt row-major mapping for C store (WIP)
...
Doesn't work because 1x2 jagged mapping is required to achieve
throughput for storing the bigger C matrix (2x4, vs. 2x2 in A).
2024-10-02 15:14:55 -07:00
Hansung Kim
3490294626
generate_matrix.py: switch to fp16 rand, generate row-major A
2024-10-02 11:01:23 -07:00
Hansung Kim
db2789bf23
Add asm label for cisc compute
2024-10-02 10:59:14 -07:00
Hansung Kim
6f6ee5616f
Add convergent attribute to vx_barrier
...
Note this attribute is only supported by Clang, so this will only be
applied to the kernel binary but not runtime.
2024-10-02 10:57:45 -07:00
Hansung Kim
221d5f75c2
flash: Optimize smem alloc for tcore for 8banks
...
Divide into first half & last half for warpgroup 0 & 1, and
allocate Q/K and P/V in different banks for parallel acccess.
2024-09-19 21:31:39 -07:00
Hansung Kim
d0ef06cec1
flash: Complete Q_IS_K_MAJOR code for GEMM II
2024-09-19 20:36:03 -07:00
Hansung Kim
b9cafd6372
idle: unused const
2024-09-18 18:10:29 -07:00
Hansung Kim
be15cffbf3
flash: Revert to gemmini config, remove DEBUG and unnecessary checks
2024-09-12 14:25:33 -07:00
Hansung Kim
b5916f3f07
flash: Fix hardcoded barrier for tcore; move tcore-specific flags
2024-09-11 22:08:44 -07:00
Hansung Kim
d69707f686
flash: Enable GEMM II fence; Pull 1st KV move out of the loop
2024-09-11 19:24:06 -07:00
Hansung Kim
18cf0e73cd
flash: Add early return for warp-indivisible row iter
2024-09-11 00:56:09 -07:00
Hansung Kim
068d48534e
flash: Swap S1/S0 to avoid GEMM II - softmax bank conflict
...
+ remove spurrious fences to better overlap GEMM I and DMA
2024-09-11 00:55:36 -07:00
Hansung Kim
ba66d2c2bd
sgemm_impl: barrier dumb dumb
2024-09-11 00:01:56 -07:00
Hansung Kim
dc746272fb
flash: Conditionally enable GEMM II fence code, fix tile_k for DEBUG
2024-09-10 22:53:35 -07:00
Hansung Kim
28b2eaec8f
sgemm_gemmini_dma: Fix tile size to (128,64,128)
2024-09-10 18:29:40 -07:00
Hansung Kim
2152c80ffd
sgemm_impl: Add missing reconvergence barrier after mmio
2024-09-10 18:05:01 -07:00
Hansung Kim
ccddd0bcc9
sgemm_impl: Remove unused FLEXIBLE_LAYOUT
2024-09-10 15:54:17 -07:00
Hansung Kim
90e03894fc
flash: Add flag in SMEM for dependency check on O
...
TODO: results unverified.
Stalls O rescale until GEMM II finishes.
2024-09-10 13:42:47 -07:00
Hansung Kim
88760596cb
flash: Remove bogus mvout to SMEM code
2024-09-09 17:18:59 -07:00
Hansung Kim
a17edac875
flash: Fix barrier stall with DEBUG
...
Verified for up to P_expected on 2nd iter; O_before_PV is partially
correct
2024-09-09 17:02:05 -07:00
Hansung Kim
b652e25945
flash: Warp-specialize between warp 0 and 1-7
...
Finishes without stalls; No dependency check between O rescale and
GEMM-II.
2024-09-09 16:42:30 -07:00
Hansung Kim
d31c8ffd7d
flash: Fix grid size to hw cluster size
...
Verified fast config, minus the barrier stall at the end.
2024-09-09 15:44:03 -07:00
Hansung Kim
829af5d429
flash: Comment out mvout to smem
...
Verified up to O_before_PV; still stalls without DEBUG
2024-09-09 15:21:49 -07:00
Hansung Kim
ecc800964a
flash: Change smem alloc for less bank conflicts; noskip stc
2024-09-09 13:47:18 -07:00
Hansung Kim
1f51f7f9d4
sgemm_impl: Mark threadblock_barrier convergent
...
Thank you Chris Lattner
2024-09-08 22:49:38 -07:00
Hansung Kim
714b9f501e
flash: Restructure to do delayed fence for better concurrency
...
Verified up to O_before_PV of 2nd iteration; O_after_PV needs preload
fix.
FIXME: Stalls at barrier without DEBUG set.
2024-09-08 22:16:17 -07:00