Striding stack space for threads by power-of-two risks possibilities of bank
conflicts or cache aliasing problems. Add an extra offset of 4 bytes to avoid
this.
Implements round-robin allocation of warps to cores & maintains contiguous
thread ID allocation to neighboring threads. Also handles partially-enabled
remainder warp logic.
TODO: Hardcodes only 1 cluster in the system.
This scheduling logic tries to evenly distribute warps across *all* cores,
instead of trying to fill up the first cores as much as possible. This scheme
is necessary for the intra-cluster cores which are assumed to have equal
workloads distributed.
Spawns tasks in a way that the threads in a warp see contiguous
thread_id, unlike the original variant where each thread were allocated
a range of thread_id that spans the number of batches.
E.g. in a 4-thread config, instead of mapping IDs (0,2,4,6)->(1,3,5,7),
map (0,1,2,3)->(4,5,6,7).
TODO remaining logic not implemented.
+ Microarchitecture optimizations
+ 64-bit support
+ Xilinx FPGA support
+ LLVM-16 support
+ Refactoring and quality control fixes
minor update
minor update
minor update
minor update
minor update
minor update
cleanup
cleanup
cache bindings and memory perf refactory
minor update
minor update
hw unit tests fixes
minor update
minor update
minor update
minor update
minor update
minor udpate
minor update
minor update
minor update
minor update
minor update
minor update
minor update
minor updates
minor updates
minor update
minor update
minor update
minor update
minor update
minor update
minor updates
minor updates
minor updates
minor updates
minor update
minor update