Enqueue all different-warp reqs into the queue. There is a slight chance
that an HGMMA_WAIT might be blocked from commit when there are multiple
different-warp HGMMAs blocking the dequeue end, but it should be
uncommon.
This allows for back-to-back issue of HGMMA past the scoreboard, which
helps to minimize downtime in DPU activity in-between operations.
HGMMA_WAIT now only unblocks when *all* previous HGMMAs have finished
writeback.
If we let back-to-back HGMMAs pass at scoreboard, we can't accurately
keep track of the busy state of the tensor core and block WAITs
accordingly.
TODO: Distinguish "ready-to-fire" from "ready-to-use-writeback".
Upon completion of an op, tensor_core_hopper sends a "ghost" commit
signal down the pipeline with the `wb` and `tensor` bit set in
commit_if. The scoreboard receives this signal via writeback_if and
resets the inuse_tensor status bit back to zero, which unblocks the
HGMMA_WAIT instruction.
HGMMA_WAIT instruction stalls at issue when inuse_tensor is set, which
is done by the previous HGMMA insn. Currently inuse_tensor is never set
back to zero.
For use in the asynchronous tensor instruction. When 1'b1, sets/unsets
the inuse_tensor status bit in the scoreboard to signal
kickoff/completion of the asynchronous tensor op.
Trick is to set commit_if.data.eop to 0, since the commit module only
signals instruction completion to VX_schedule if the eop bit is 1.
Otherwise it underflows the pending_instr buffer.
The same eop trick works for VX_scoreboard, which works around the
invalid rd writeback error.
Define EXT_T_HOPPER that, when EXT_T_ENABLE is defined, distinguishes
whether to instantiate core-coupled Volta-style or decoupled
Hopper-style Tensor Core.