This increases util by pulling the DMA wait time out of the K-loop wraparound (next N) and overlapping it with the last K iter.
This increases util by pulling the DMA wait time out of the K-loop wraparound (next N) and overlapping it with the last K iter.