Note: now that threads_per_threadblock is passed as compile-time constant, the compiler likes to completely loop unroll which can cause a lot of stack spills. todo fix GEMM part.
Note: now that threads_per_threadblock is passed as compile-time constant, the compiler likes to completely loop unroll which can cause a lot of stack spills. todo fix GEMM part.