... by splitting vx_wmma_load to vx_wmma_load_{a,b} and pulling it out of the innermost loop. TODO: there's some duplicate address compute being done in the both functions.