kernels

Author	SHA1	Message	Date
Richard Yan	c114a7a4ab	new gemm kernel	2024-11-08 20:55:27 -08:00
Hansung Kim	4e087a8aab	flash: Fix loop iteration for gemmini Kernel is software-pipelined around 2 GEMMs and softmax; it requires two iterations to fully complete a tile.	2024-11-08 16:43:08 -08:00
Hansung Kim	4055255018	flash: Fix tcore kernel for CISC arg field changes	2024-11-08 16:40:16 -08:00
Hansung Kim	c001618fb9	sgemm_impl: Fix wrong next block_m logic for DMA	2024-10-29 22:35:56 -07:00
Hansung Kim	21b6655c10	sgemm_impl: Implement fast coalesced wmma_store Enables a fairer comparison between core-coupled tensor core to Hopper tensor core, where the latter benefits from coalesced full-throughput moveout to GMEM because it does not use the 1x2 interleaved register mapping. This means the result matrix will be stored swizzled in the GMEM, without breaking correctness.	2024-10-29 22:34:22 -07:00
Hansung Kim	6b39a6fe70	Add convenience script for switching input/args binaries	2024-10-29 20:14:33 -07:00
Hansung Kim	8dadbdd42d	tensor: Do DMA mvin for next m/n loop at the last k iter This increases util by pulling the DMA wait time out of the K-loop wraparound (next N) and overlapping it with the last K iter.	2024-10-29 19:43:22 -07:00
Hansung Kim	367fa927f8	sgemm_impl: Fix default FP_SIZE to 16	2024-10-29 14:56:18 -07:00
Hansung Kim	e1b0fc3944	generate_matrix.py: Rand [0,1); also save non-swizzled row-major B	2024-10-29 14:55:32 -07:00
Hansung Kim	24064dc7a2	sgemm_impl: Do proper addr gen and store for wgmma	2024-10-29 01:31:55 -07:00
Hansung Kim	bd7a8e39b9	sgemm_impl: Split out smem addr gen to functions so that the addr gen code can also be used for wgmma.	2024-10-29 01:30:48 -07:00
Hansung Kim	ae98ae6e93	sgemm_tcore: Fix DMA smem addresses, add markers Take into account that DMA writes B tiles starting from the end of the quartile.	2024-10-28 17:26:07 -07:00
Hansung Kim	b4dadfaf61	Merge remote-tracking branch 'origin/kernels' into kernels-hopper	2024-10-28 14:25:18 -07:00
Richard Yan	fd1c9f4729	update gemmini dma kernel	2024-10-28 13:47:13 -07:00
Hansung Kim	d0421426be	sgemm_tcore: Hardcode CISC spadQuartile addresses	2024-10-28 12:49:22 -07:00
Hansung Kim	e55c8b480e	sgemm_impl: Comment out GEMMINI_DMA code in single_tile This is already done in the higher-level thread_block_gemm function, and flash also has explicit DMA sync code. Also having this executed twice sometimes triggers vx_bar movement into a branch which we really want to avoid.	2024-10-28 12:47:49 -07:00
Hansung Kim	36eb50060f	sgemm_impl: Add skeleton wgmma routine for single_tile	2024-10-28 12:47:20 -07:00
Hansung Kim	e8a943e893	Push golden sgemm_tcore kernels	2024-10-27 19:40:08 -07:00
Richard Yan	379d863456	Merge branch 'kernels' of https://github.com/hansungk/vortex-private into kernels	2024-10-24 17:31:01 -07:00
Richard Yan	fb928b5cda	new unaligned access kernel, update idle kernel	2024-10-24 17:28:59 -07:00
Richard Yan	8cc0c3bae4	fp16 no dma kernel	2024-10-24 17:12:34 -07:00
Richard Yan	04a64dee7a	fp16 dma kernel	2024-10-24 17:12:02 -07:00
Hansung Kim	6417a625b1	sgemm_impl: Add tiling params for hopper tensor core	2024-10-23 19:50:18 -07:00
Hansung Kim	68cd6455fe	sgemm_impl: Add mmio reconverge barrier to avoid slip-off; switch to FP32	2024-10-02 15:17:44 -07:00
Hansung Kim	34d0956cd5	tensor: Attempt row-major mapping for C store (WIP) Doesn't work because 1x2 jagged mapping is required to achieve throughput for storing the bigger C matrix (2x4, vs. 2x2 in A).	2024-10-02 15:14:55 -07:00
Hansung Kim	3490294626	generate_matrix.py: switch to fp16 rand, generate row-major A	2024-10-02 11:01:23 -07:00
Hansung Kim	db2789bf23	Add asm label for cisc compute	2024-10-02 10:59:14 -07:00
Hansung Kim	6f6ee5616f	Add convergent attribute to vx_barrier Note this attribute is only supported by Clang, so this will only be applied to the kernel binary but not runtime.	2024-10-02 10:57:45 -07:00
Hansung Kim	221d5f75c2	flash: Optimize smem alloc for tcore for 8banks Divide into first half & last half for warpgroup 0 & 1, and allocate Q/K and P/V in different banks for parallel acccess.	2024-09-19 21:31:39 -07:00
Hansung Kim	d0ef06cec1	flash: Complete Q_IS_K_MAJOR code for GEMM II	2024-09-19 20:36:03 -07:00
Hansung Kim	b9cafd6372	idle: unused const	2024-09-18 18:10:29 -07:00
Hansung Kim	be15cffbf3	flash: Revert to gemmini config, remove DEBUG and unnecessary checks	2024-09-12 14:25:33 -07:00
Hansung Kim	b5916f3f07	flash: Fix hardcoded barrier for tcore; move tcore-specific flags	2024-09-11 22:08:44 -07:00
Hansung Kim	d69707f686	flash: Enable GEMM II fence; Pull 1st KV move out of the loop	2024-09-11 19:24:06 -07:00
Hansung Kim	18cf0e73cd	flash: Add early return for warp-indivisible row iter	2024-09-11 00:56:09 -07:00
Hansung Kim	068d48534e	flash: Swap S1/S0 to avoid GEMM II - softmax bank conflict + remove spurrious fences to better overlap GEMM I and DMA	2024-09-11 00:55:36 -07:00
Hansung Kim	ba66d2c2bd	sgemm_impl: barrier dumb dumb	2024-09-11 00:01:56 -07:00
Hansung Kim	dc746272fb	flash: Conditionally enable GEMM II fence code, fix tile_k for DEBUG	2024-09-10 22:53:35 -07:00
Hansung Kim	28b2eaec8f	sgemm_gemmini_dma: Fix tile size to (128,64,128)	2024-09-10 18:29:40 -07:00
Hansung Kim	2152c80ffd	sgemm_impl: Add missing reconvergence barrier after mmio	2024-09-10 18:05:01 -07:00
Hansung Kim	ccddd0bcc9	sgemm_impl: Remove unused FLEXIBLE_LAYOUT	2024-09-10 15:54:17 -07:00
Hansung Kim	90e03894fc	flash: Add flag in SMEM for dependency check on O TODO: results unverified. Stalls O rescale until GEMM II finishes.	2024-09-10 13:42:47 -07:00
Hansung Kim	88760596cb	flash: Remove bogus mvout to SMEM code	2024-09-09 17:18:59 -07:00
Hansung Kim	a17edac875	flash: Fix barrier stall with DEBUG Verified for up to P_expected on 2nd iter; O_before_PV is partially correct	2024-09-09 17:02:05 -07:00
Hansung Kim	b652e25945	flash: Warp-specialize between warp 0 and 1-7 Finishes without stalls; No dependency check between O rescale and GEMM-II.	2024-09-09 16:42:30 -07:00
Hansung Kim	d31c8ffd7d	flash: Fix grid size to hw cluster size Verified fast config, minus the barrier stall at the end.	2024-09-09 15:44:03 -07:00
Hansung Kim	829af5d429	flash: Comment out mvout to smem Verified up to O_before_PV; still stalls without DEBUG	2024-09-09 15:21:49 -07:00
Hansung Kim	ecc800964a	flash: Change smem alloc for less bank conflicts; noskip stc	2024-09-09 13:47:18 -07:00
Hansung Kim	1f51f7f9d4	sgemm_impl: Mark threadblock_barrier convergent Thank you Chris Lattner	2024-09-08 22:49:38 -07:00
Hansung Kim	714b9f501e	flash: Restructure to do delayed fence for better concurrency Verified up to O_before_PV of 2nd iteration; O_after_PV needs preload fix. FIXME: Stalls at barrier without DEBUG set.	2024-09-08 22:16:17 -07:00

1 2 3 4 5 ...

2678 Commits