With 4 warps, we can only do 32x64 GEMM; serialize 64x64 into 2 32x64 GEMM calls by split by the row.