Upon completion of an op, tensor_core_hopper sends a "ghost" commit
signal down the pipeline with the `wb` and `tensor` bit set in
commit_if. The scoreboard receives this signal via writeback_if and
resets the inuse_tensor status bit back to zero, which unblocks the
HGMMA_WAIT instruction.
HGMMA_WAIT instruction stalls at issue when inuse_tensor is set, which
is done by the previous HGMMA insn. Currently inuse_tensor is never set
back to zero.
For use in the asynchronous tensor instruction. When 1'b1, sets/unsets
the inuse_tensor status bit in the scoreboard to signal
kickoff/completion of the asynchronous tensor op.
Trick is to set commit_if.data.eop to 0, since the commit module only
signals instruction completion to VX_schedule if the eop bit is 1.
Otherwise it underflows the pending_instr buffer.
The same eop trick works for VX_scoreboard, which works around the
invalid rd writeback error.
Define EXT_T_HOPPER that, when EXT_T_ENABLE is defined, distinguishes
whether to instantiate core-coupled Volta-style or decoupled
Hopper-style Tensor Core.
The two threadgroups use the same B fragment, so no need to duplicately
store them in the operand buffer. To do this, pull the operand buffer
out of the threadgroups to the octet-level.
Instead of having a single candidate to be considered for dispatch
(designated by 'batch_idx' counter), add a dispatch_unit variant that
considerse all `ISSUE_WIDTH dispatch signals and picks a valid one in a
round-robin manner.
This increases core utilization significantly due to better overlapping
of smem/tensor ops.
Otherwise the first-in-pair instructions can run ahead, latching their
inputs for the next pair before the second-in-pair insts finish compute
on the current one. Might introduce more frontend stalls, need more
experimenting
An easy solution to handle multiple concurrent warp operations by
staging half-operands in their own per-warp register. This might
increase area requirement by quite a bit.
TODO: Commit is not being handled correctly yet