Since core does not support memory accesses to non-word-aligned addresses, pack fp16 elements in pairs into fp32 values, and do regular tile movement with conditionally compressed column dimensions. Perf seems to stay the same for fp32 256x256.