kernels

Author	SHA1	Message	Date
Hansung Kim	e4c0bbd039	sgemm: Check-in argument binaries	2025-01-28 16:04:56 -08:00
Hansung Kim	45e9407c99	sgemm: Check-in argument binaries	2025-01-28 15:58:27 -08:00
Hansung Kim	9894efe6c9	Update toolchain env paths for dork	2025-01-28 15:04:14 -08:00
Hansung Kim	5ef4c8023e	sgemm_impl: Disable wmma fast store Doesn't seem to have a big impact on tcore util.	2024-11-11 14:06:15 -08:00
Hansung Kim	7d7cb5f60a	flash: Disable perf loop multiplier	2024-11-10 22:44:02 -08:00
Hansung Kim	4448f31fdc	fence: Fix moving fence to start of loop For unknown reasons, guarding the fence with a tid == 0 branch causes a TL source ID re-used assertion. Just call the fence from all thread/warps as a workaround. At least, all threads in a warp will coalesce into one request.	2024-11-09 22:04:45 -08:00
Hansung Kim	cb916ead39	Fix potential bitwidth bug in compute API	2024-11-09 20:59:58 -08:00
Hansung Kim	68054689c9	flash: Move fence to start of loop; wrap all MMIO in one tid=0 branch	2024-11-09 20:59:26 -08:00
Hansung Kim	fcd8b0b892	flash: Disable rescale flag check GEMM-II finishes much earlier than softmax for this to be a problem.	2024-11-09 20:37:58 -08:00
Hansung Kim	1c9b022156	flash: Rename nowarpspec to default	2024-11-09 19:58:45 -08:00
Hansung Kim	8fe6d918f2	flash: Update tcore kernel to use new CISC	2024-11-09 19:49:20 -08:00
Hansung Kim	76a6aaf085	flash: doc update	2024-11-09 19:09:09 -08:00
Hansung Kim	673e07ed43	flash: Add non-warp-specialized gemmini flash kernel	2024-11-09 19:08:39 -08:00
Hansung Kim	ac42f2dbba	sgemm_gemmini_dma: Update with new compute API	2024-11-09 16:49:39 -08:00
Hansung Kim	ad75561efe	flash: Reduce fence calls to improve util	2024-11-09 16:44:17 -08:00
Hansung Kim	6990fcc1e6	Add compute-and-mvout-to-spad API	2024-11-09 16:43:45 -08:00
Hansung Kim	952b8debbb	flash: Update to use new CISC interface	2024-11-09 16:21:34 -08:00
Hansung Kim	dc89309ad0	Merge branch 'kernels-flash' into new-cisc	2024-11-09 14:42:46 -08:00
Hansung Kim	365b1d8e67	flash: Add begin end markers	2024-11-09 10:19:16 -08:00
Hansung Kim	1e3d476e70	Switch header configs to flash	2024-11-08 21:56:42 -08:00
Richard Yan	c114a7a4ab	new gemm kernel	2024-11-08 20:55:27 -08:00
Hansung Kim	4e087a8aab	flash: Fix loop iteration for gemmini Kernel is software-pipelined around 2 GEMMs and softmax; it requires two iterations to fully complete a tile.	2024-11-08 16:43:08 -08:00
Hansung Kim	4055255018	flash: Fix tcore kernel for CISC arg field changes	2024-11-08 16:40:16 -08:00
Hansung Kim	c001618fb9	sgemm_impl: Fix wrong next block_m logic for DMA	2024-10-29 22:35:56 -07:00
Hansung Kim	21b6655c10	sgemm_impl: Implement fast coalesced wmma_store Enables a fairer comparison between core-coupled tensor core to Hopper tensor core, where the latter benefits from coalesced full-throughput moveout to GMEM because it does not use the 1x2 interleaved register mapping. This means the result matrix will be stored swizzled in the GMEM, without breaking correctness.	2024-10-29 22:34:22 -07:00
Hansung Kim	6b39a6fe70	Add convenience script for switching input/args binaries	2024-10-29 20:14:33 -07:00
Hansung Kim	8dadbdd42d	tensor: Do DMA mvin for next m/n loop at the last k iter This increases util by pulling the DMA wait time out of the K-loop wraparound (next N) and overlapping it with the last K iter.	2024-10-29 19:43:22 -07:00
Hansung Kim	367fa927f8	sgemm_impl: Fix default FP_SIZE to 16	2024-10-29 14:56:18 -07:00
Hansung Kim	e1b0fc3944	generate_matrix.py: Rand [0,1); also save non-swizzled row-major B	2024-10-29 14:55:32 -07:00
Hansung Kim	24064dc7a2	sgemm_impl: Do proper addr gen and store for wgmma	2024-10-29 01:31:55 -07:00
Hansung Kim	bd7a8e39b9	sgemm_impl: Split out smem addr gen to functions so that the addr gen code can also be used for wgmma.	2024-10-29 01:30:48 -07:00
Hansung Kim	ae98ae6e93	sgemm_tcore: Fix DMA smem addresses, add markers Take into account that DMA writes B tiles starting from the end of the quartile.	2024-10-28 17:26:07 -07:00
Hansung Kim	b4dadfaf61	Merge remote-tracking branch 'origin/kernels' into kernels-hopper	2024-10-28 14:25:18 -07:00
Richard Yan	fd1c9f4729	update gemmini dma kernel	2024-10-28 13:47:13 -07:00
Hansung Kim	d0421426be	sgemm_tcore: Hardcode CISC spadQuartile addresses	2024-10-28 12:49:22 -07:00
Hansung Kim	e55c8b480e	sgemm_impl: Comment out GEMMINI_DMA code in single_tile This is already done in the higher-level thread_block_gemm function, and flash also has explicit DMA sync code. Also having this executed twice sometimes triggers vx_bar movement into a branch which we really want to avoid.	2024-10-28 12:47:49 -07:00
Hansung Kim	36eb50060f	sgemm_impl: Add skeleton wgmma routine for single_tile	2024-10-28 12:47:20 -07:00
Hansung Kim	e8a943e893	Push golden sgemm_tcore kernels	2024-10-27 19:40:08 -07:00
Richard Yan	379d863456	Merge branch 'kernels' of https://github.com/hansungk/vortex-private into kernels	2024-10-24 17:31:01 -07:00
Richard Yan	fb928b5cda	new unaligned access kernel, update idle kernel	2024-10-24 17:28:59 -07:00
Richard Yan	8cc0c3bae4	fp16 no dma kernel	2024-10-24 17:12:34 -07:00
Richard Yan	04a64dee7a	fp16 dma kernel	2024-10-24 17:12:02 -07:00
Hansung Kim	6417a625b1	sgemm_impl: Add tiling params for hopper tensor core	2024-10-23 19:50:18 -07:00
Hansung Kim	68cd6455fe	sgemm_impl: Add mmio reconverge barrier to avoid slip-off; switch to FP32	2024-10-02 15:17:44 -07:00
Hansung Kim	34d0956cd5	tensor: Attempt row-major mapping for C store (WIP) Doesn't work because 1x2 jagged mapping is required to achieve throughput for storing the bigger C matrix (2x4, vs. 2x2 in A).	2024-10-02 15:14:55 -07:00
Hansung Kim	3490294626	generate_matrix.py: switch to fp16 rand, generate row-major A	2024-10-02 11:01:23 -07:00
Hansung Kim	db2789bf23	Add asm label for cisc compute	2024-10-02 10:59:14 -07:00
Hansung Kim	6f6ee5616f	Add convergent attribute to vx_barrier Note this attribute is only supported by Clang, so this will only be applied to the kernel binary but not runtime.	2024-10-02 10:57:45 -07:00
Hansung Kim	221d5f75c2	flash: Optimize smem alloc for tcore for 8banks Divide into first half & last half for warpgroup 0 & 1, and allocate Q/K and P/V in different banks for parallel acccess.	2024-09-19 21:31:39 -07:00
Hansung Kim	d0ef06cec1	flash: Complete Q_IS_K_MAJOR code for GEMM II	2024-09-19 20:36:03 -07:00

1 2 3 4 5 ...

2648 Commits