Doesn't work because 1x2 jagged mapping is required to achieve throughput for storing the bigger C matrix (2x4, vs. 2x2 in A).
tensor
Unittest kernel for the tensor core.
Build
$ python3 generate_matrix.py
$ make
The generated ELF binary is run standalone; the argument and input matrix binary are hardcoded into the binary.