Divide into first half & last half for warpgroup 0 & 1, and allocate Q/K and P/V in different banks for parallel acccess.