kernels

Author	SHA1	Message	Date
Richard Yan	6eafa2de54	write operands to elf	2024-04-24 22:09:30 -07:00
Richard Yan	4e9855dc33	highly unrolled a/b load	2024-04-16 22:19:30 -07:00
Richard Yan	449d99f0bb	dram gemm kernel	2024-04-16 17:15:22 -07:00
Richard Yan	99621a0df9	Merge branch 'kernels' of https://github.com/hansungk/vortex-private into kernels	2024-04-15 10:22:19 -07:00
Richard Yan	041d49fb58	update gemmini only kernel	2024-04-15 10:22:00 -07:00
Richard Yan	0bb7aeb45b	add gpu+gemmini gemm kernel	2024-04-15 10:13:37 -07:00
Richard Yan	d8eddb21ea	add gemmini dependency	2024-04-15 10:04:54 -07:00
Hansung Kim	37a60b1141	sgemm_wg: Output C result to binary	2024-04-14 12:36:06 -07:00
Hansung Kim	3383b70732	sgemm_wg: Hardcode device address	2024-04-14 12:36:00 -07:00
Richard Yan	7bf72c9568	cycle counting for fence	2024-04-09 19:53:17 -07:00
Hansung Kim	93a00101ae	sgemm_wg: revert to faster params	2024-04-04 21:06:14 -07:00
Richard Yan	84a31f3384	thread parallel data loading for word strided bank	2024-04-01 11:10:32 -07:00
Richard Yan	e6db1a83af	Merge branch 'kernels' of https://github.com/hansungk/vortex-private into kernels	2024-04-01 11:09:43 -07:00
Hansung Kim	b0c1f77388	vx_start.S: Swizzle stack space Striding stack space for threads by power-of-two risks possibilities of bank conflicts or cache aliasing problems. Add an extra offset of 4 bytes to avoid this.	2024-03-29 12:26:14 -07:00
Hansung Kim	fa2b6e2ad0	sgemm_wg: Explicitly limit unroll to reduce stack spilling This needs to be done case-by-case for different BK/TM/TN combinations and examining the assembly.	2024-03-29 02:48:29 -07:00
Hansung Kim	537b97eb20	common.mk: Don't clean all *.elf	2024-03-28 20:17:26 -07:00
Hansung Kim	e4eec8ab4d	vx_spawn.c: Handle num_clusters > 1 WIP: still assumes num_tasks is divisible by num_cluster	2024-03-28 20:16:44 -07:00
Hansung Kim	a9b0814211	sgemm_wg: Document tiling parameter constraints	2024-03-28 18:17:00 -07:00
Hansung Kim	9673db4e8c	sgemm_wg: Fix possible divide-by-0	2024-03-28 17:35:47 -07:00
Hansung Kim	9555b790e7	sgemm_wg: ifdef-guard cluster specific code	2024-03-27 22:45:51 -07:00
Hansung Kim	09822764e7	sgemm_wg: Remove software-based barrier implementation Intra-cluster barrier is now implemented in hardware, transparent to the ISA.	2024-03-27 22:43:45 -07:00
Hansung Kim	870846f20f	vx_spawn.c: Create separate vx_spawn_tasks_contiguous	2024-03-27 15:38:52 -07:00
Hansung Kim	fa6adceb7e	vecaddx: Hardcode args/input device address to match chipyard Don't use mem_alloc/mem_free API	2024-03-27 15:15:52 -07:00
Hansung Kim	4e834f2103	vx_spawn.c: Rewrite cluster-based vx_spawn_tasks variant Implements round-robin allocation of warps to cores & maintains contiguous thread ID allocation to neighboring threads. Also handles partially-enabled remainder warp logic. TODO: Hardcodes only 1 cluster in the system.	2024-03-27 15:14:45 -07:00
Hansung Kim	df1f7f242a	vx_spawn.c: Implement spawn_tasks_cluster_rem_stub	2024-03-27 00:00:44 -07:00
Richard Yan	b88dbd7a83	add cycle count and multi core support	2024-03-26 16:43:49 -07:00
Hansung Kim	b545809496	vecaddx: Use -DRADIANCE	2024-03-26 16:42:36 -07:00
Hansung Kim	4d2c0084d1	common.mk: Compile separate cluster ELF ... using -DRADIANCE, which the kernel C code use explicitly to switch between vx_spawn_tasks and vx_spawn_tasks_cluster. This is to ease running both simX and Chipyard simulations without mixing up binaries.	2024-03-26 16:37:44 -07:00
Hansung Kim	3729a05adc	vx_spawn.c: Separate cluster-based scheduling code from original	2024-03-26 16:36:57 -07:00
Hansung Kim	f050a08d77	Write vx_spawn_tasks_cluster This scheduling logic tries to evenly distribute warps across all cores, instead of trying to fill up the first cores as much as possible. This scheme is necessary for the intra-cluster cores which are assumed to have equal workloads distributed.	2024-03-26 10:45:14 -07:00
Hansung Kim	7f00e6c376	vecaddx: Change arg device address to 7fff0000	2024-03-26 10:44:33 -07:00
Hansung Kim	cc7b34ec5b	vecaddx: Write args.bin and input.bin	2024-03-26 10:44:02 -07:00
Hansung Kim	ff401bdec0	Cleanup tests/.gitignore	2024-03-24 01:47:00 -07:00
Hansung Kim	7d177492b2	Move CORES_PER_CLUSTER to vx_spawn.h	2024-03-24 01:45:30 -07:00
Hansung Kim	8f3474b151	Don't clean *.bin	2024-03-24 01:45:08 -07:00
Hansung Kim	f590c4b417	Add vx_spawn.h as dependency to kernel/Makefile	2024-03-24 01:44:49 -07:00
Richard Yan	c18267443f	matmul kernel switch to proper fence and fsm	2024-03-20 15:22:25 -07:00
Richard Yan	94ad1850a9	implement correct gemmini fence and loop fsm support	2024-03-20 15:18:31 -07:00
Hansung Kim	12ee2a3a0f	Write cluster-aware thread scheduling NOTE: cores per cluster is hardcoded as a constant	2024-03-18 16:40:02 -07:00
Hansung Kim	3e6771237f	Merge remote-tracking branch 'sungwoong/master' into kernels	2024-03-14 09:48:31 -07:00
Hansung Kim	2036d37840	sgemm_wg: Prevent run-ahead using ternary flags; reduce mem accesses	2024-03-13 21:35:24 -07:00
Hansung Kim	510a834db5	sgemm_wg: Implement software barrier for inter-core synchronization	2024-03-12 15:34:42 -07:00
Hansung Kim	fbe872c831	sgemm_wg: Add missing makefile dep to common.h	2024-03-12 15:34:17 -07:00
Sungwoong Ha	3c2a266d37	second pass	2024-03-01 21:27:26 -08:00
Sungwoong Ha	a9709edae2	first pass	2024-03-01 21:05:52 -08:00
Hansung Kim	6f4dfe5a0e	sgemm_wg: Implement 2D threadtiling	2024-02-29 14:40:54 -08:00
Hansung Kim	a06b2dd20e	sgemm_wg: Cleanup & proper unroll	2024-02-28 21:17:42 -08:00
Hansung Kim	46f242e520	sgemm_wg: Constantify BM/BN/BK/TM, computationally set gridsize and TB/core	2024-02-27 22:23:25 -08:00
Hansung Kim	27646bb507	sgemm_wg: Implement multiple C per thread with sliding A/B blocks	2024-02-27 22:06:01 -08:00
Hansung Kim	a2ea27b2b5	vx_spawn: Add spawn_tasks_contiguous_all_stub Spawns tasks in a way that the threads in a warp see contiguous thread_id, unlike the original variant where each thread were allocated a range of thread_id that spans the number of batches. E.g. in a 4-thread config, instead of mapping IDs (0,2,4,6)->(1,3,5,7), map (0,1,2,3)->(4,5,6,7). TODO remaining logic not implemented.	2024-02-27 15:46:02 -08:00

1 2 3 4 5 ...

2338 Commits