Doesn't work because 1x2 jagged mapping is required to achieve throughput for storing the bigger C matrix (2x4, vs. 2x2 in A).
Doesn't work because 1x2 jagged mapping is required to achieve throughput for storing the bigger C matrix (2x4, vs. 2x2 in A).