Compare commits

316 Commits

Author SHA1 Message Date
0ad87bde81 Implement WU architecture support 2026-05-25 19:25:05 +08:00
323ed7d7e9 Update Vortex core for Blackwell tensor instructions
- Add Blackwell tensor core support in VX_tensor_blackwell_core.sv
- Update decode, execute, and dispatch logic for new instructions
- Extend VX_define.vh and VX_types.vh with Blackwell ISA definitions
2026-05-06 14:50:54 +08:00
cb912d3b8b Add Blackwell tensor RTL scaffolding 2026-04-25 10:15:31 +08:00
Hansung Kim
f1d0fac518 Change to 8-core Volta/Ampere config 2025-01-28 22:36:58 -08:00
Hansung Kim
c8529c4339 Disable EXT_T_HOPPER atm for flash runs 2024-11-08 21:52:52 -08:00
Hansung Kim
cf000afc8f tensor: Remove unused a[2] and a[3] ports for FP32 DPU 2024-11-08 14:34:47 -08:00
Richard Yan
8dc2a25e32 oopsie 3 2024-11-02 14:52:43 -07:00
Richard Yan
d794e055b6 oopsie 2 2024-11-02 14:51:52 -07:00
Richard Yan
ed61418ebf oopsie 2024-11-01 02:55:38 -07:00
Richard Yan
2e3ea060a5 gate operand read 2024-11-01 02:44:35 -07:00
Hansung Kim
ef902614ff tensor: Fix race in inflight_tensor counter 2024-10-29 14:14:31 -07:00
Hansung Kim
1013a74abd tensor: Switch back to hopper + 4 cores 2024-10-28 23:39:39 -07:00
Hansung Kim
19876ab9fd tensor: Fix wrong writeback bit 2024-10-28 21:47:25 -07:00
Hansung Kim
8a66b5ed89 tensor: Connect SMEM addr/rf IO 2024-10-28 19:42:02 -07:00
Hansung Kim
4376bd33a2 tensor: Decode rs1/rs2 of HGMMA for smem addresses 2024-10-28 19:41:37 -07:00
Hansung Kim
72db04cec0 tensor: Switch to 8cores, non-hopper config 2024-10-27 19:47:22 -07:00
Hansung Kim
3e67ddd6c6 tensor: Properly guard tc_rf_if for non-hopper 2024-10-27 17:55:09 -07:00
Hansung Kim
1bc4afe2bb tensor: Bore tensor regfile IO to execute units 2024-10-24 20:32:18 -07:00
Hansung Kim
c88fd89f1f tensor: Don't make initiate_valid depend on ready 2024-10-24 19:29:21 -07:00
Richard Yan
b64e53ff02 Merge branch 'rtl' of github.com:hansungk/vortex-private into rtl 2024-10-24 16:51:22 -07:00
Richard Yan
155cbb0abc tc rf read port 2024-10-24 16:51:15 -07:00
Hansung Kim
40565de8cd tensor: Fix initiate sync with meta queue when !commit.ready 2024-10-24 16:41:54 -07:00
Hansung Kim
3ebeb43568 tensor: Fix inflight_tensor decrement, add under/overflow checks 2024-10-24 14:36:29 -07:00
Hansung Kim
8337488ed3 tensor: Don't check invalid writeback reg for ghost writes 2024-10-24 14:36:18 -07:00
Hansung Kim
e855a47295 Add missing commit_if.tensor bit inits 2024-10-24 13:28:30 -07:00
Hansung Kim
c77a25c968 tensor: Add missing HOPPER guard 2024-10-23 20:33:45 -07:00
Hansung Kim
78df981366 tensor: Simply metadata queue
Enqueue all different-warp reqs into the queue. There is a slight chance
that an HGMMA_WAIT might be blocked from commit when there are multiple
different-warp HGMMAs blocking the dequeue end, but it should be
uncommon.
2024-10-22 22:01:18 -07:00
Hansung Kim
69cbbdd89b tensor: Consider inflight ops for HGMMA blocking
This allows for back-to-back issue of HGMMA past the scoreboard, which
helps to minimize downtime in DPU activity in-between operations.
HGMMA_WAIT now only unblocks when *all* previous HGMMAs have finished
writeback.
2024-10-22 21:32:33 -07:00
Hansung Kim
98eb7cb594 tensor: Block both HGMMA/HGMMA_WAIT at scoreboard
If we let back-to-back HGMMAs pass at scoreboard, we can't accurately
keep track of the busy state of the tensor core and block WAITs
accordingly.

TODO: Distinguish "ready-to-fire" from "ready-to-use-writeback".
2024-10-22 21:10:55 -07:00
Hansung Kim
83979c3341 tensor: Fully connect writeback IO 2024-10-22 20:17:00 -07:00
Hansung Kim
47dff74d3a tensor: Fix commit/metadata logic for HGMMA
Block HGMMA commit until previous ones are all done; always commit
HGMMA_WAIT after it passes the scoreboard.
2024-10-22 20:01:37 -07:00
Hansung Kim
3abaaff16f tensor: Fix tag and data assignment for p0/p1 bus 2024-10-22 17:47:04 -07:00
Hansung Kim
8a8f682194 tensor: Bore smem IO from core to tensor core 2024-10-22 17:42:30 -07:00
Hansung Kim
9131558950 tensor: Connect Chisel-generated TensorCoreDecoupled module
Elaborates, but most of the IOs are tied to fake.
2024-10-22 15:16:24 -07:00
Hansung Kim
32ccdeef01 Merge branch 'tensor-decoupled' into rtl 2024-10-21 22:57:07 -07:00
Hansung Kim
0f06afc3ef Update doc 2024-10-21 22:37:20 -07:00
Richard Yan
cde8da1f3b add tag to tc smem interface 2024-10-17 14:48:39 -07:00
Hansung Kim
4dcbc31a88 tensor: Separate async commit from tensor commit
With this we can prioritize commit of the async hgmma instructions over
the "ghost" commits from the TC.
2024-10-11 21:32:20 -07:00
Hansung Kim
717fe7ff29 tensor: Fix FSM when commit not ready 2024-10-11 20:24:31 -07:00
Hansung Kim
2934b1bd94 tensor: Split execution module from pipeline logic 2024-10-11 20:09:09 -07:00
Hansung Kim
f7f23e0c05 tensor: Doc update 2024-10-11 18:00:36 -07:00
Hansung Kim
42b9d23f83 tensor: Write release logic for hgmma
Upon completion of an op, tensor_core_hopper sends a "ghost" commit
signal down the pipeline with the `wb` and `tensor` bit set in
commit_if.  The scoreboard receives this signal via writeback_if and
resets the inuse_tensor status bit back to zero, which unblocks the
HGMMA_WAIT instruction.
2024-10-11 17:58:44 -07:00
Hansung Kim
408a9b5d2a tensor: Write stall logic for hgmma_wait
HGMMA_WAIT instruction stalls at issue when inuse_tensor is set, which
is done by the previous HGMMA insn. Currently inuse_tensor is never set
back to zero.
2024-10-11 17:18:01 -07:00
Hansung Kim
72f9dedce3 tensor: Disable micro-ops for hopper
Have an uarch FSM handle the stepping mechanism entirely.
2024-10-11 15:59:31 -07:00
Hansung Kim
100d69ef21 Doc update on accumulator regs 2024-10-11 15:47:58 -07:00
Hansung Kim
d9ad4809ec Add 'tensor' bit to commit_if and writeback_if
For use in the asynchronous tensor instruction.  When 1'b1, sets/unsets
the inuse_tensor status bit in the scoreboard to signal
kickoff/completion of the asynchronous tensor op.
2024-10-11 15:42:25 -07:00
Hansung Kim
58c9761829 Revert decode change for hopper
Share the same insn as non-hopper TC.
2024-10-09 21:53:04 -07:00
Hansung Kim
7ab14445f0 tensor: Test many-commit per execute with an FSM
Trick is to set commit_if.data.eop to 0, since the commit module only
signals instruction completion to VX_schedule if the eop bit is 1.
Otherwise it underflows the pending_instr buffer.

The same eop trick works for VX_scoreboard, which works around the
invalid rd writeback error.
2024-10-07 21:29:44 -07:00
Hansung Kim
e8ca4677df Remove old code for pending_instr underflow fix 2024-10-07 20:21:35 -07:00
Hansung Kim
4cac1adf7d Add dummy code for decoupled Hopper tensor core
Define EXT_T_HOPPER that, when EXT_T_ENABLE is defined, distinguishes
whether to instantiate core-coupled Volta-style or decoupled
Hopper-style Tensor Core.
2024-10-07 17:10:59 -07:00
Richard Yan
8bf7f39f04 add tensor core memory interface 2024-10-07 02:56:38 -07:00
Hansung Kim
da54162241 tensor: Add FP16 parameter and expose to VX_core 2024-09-10 15:32:17 -07:00
Hansung Kim
a968bdd69b tensor: Fix HALF_PRECISION to 1 2024-09-08 01:43:21 -07:00
Richard Yan
3f8c28c7d6 sync rf, x0 fix 2024-09-05 16:49:05 -07:00
Hansung Kim
2b1a9b7c16 tensor: Rename & docs 2024-08-23 16:21:45 -07:00
Hansung Kim
45f6ae5aad tensor: Doc comments 2024-08-20 14:46:40 -07:00
Hansung Kim
20faf87b80 tensor: Rename halves_buf to reduce confusion 2024-08-19 16:42:02 -07:00
Hansung Kim
789d873e19 Disable reduce_unit for timing optimization
Currently the critical path @1GHz is found at the accumulators inside
reduce_unit.
2024-08-16 15:28:56 -07:00
Hansung Kim
715539b2c3 Guard trace printf in mem_scheduler for synthesis 2024-08-15 06:09:39 -07:00
Hansung Kim
119c52004e Enable LSU dedup in VX_platform.vh 2024-08-15 13:39:43 -07:00
Hansung Kim
1410b39143 Disable trace during the very start of simulation 2024-08-13 16:01:29 -07:00
Hansung Kim
d39e24643d tensor: Parameterize fedp for fp16/fp32 2024-08-12 20:01:56 -07:00
Hansung Kim
15e93e01d8 tensor: Split packed fp16 and wire correctly to DPU 2024-08-07 11:16:38 -07:00
Hansung Kim
d4d18c2823 tensor: spurious assert, doc, remove unused param 2024-07-29 16:06:55 -07:00
Hansung Kim
4e0dcdadac tensor: Share B operand buffer between threadgroups
The two threadgroups use the same B fragment, so no need to duplicately
store them in the operand buffer.  To do this, pull the operand buffer
out of the threadgroups to the octet-level.
2024-07-27 20:42:08 -07:00
Hansung Kim
7ad3f64528 tensor: Remove old ready_reg DPI code 2024-07-27 17:36:02 -07:00
Hansung Kim
01f6024a76 tensor: Split flops into structural module
to get separate area/power numbers in hierarchical
2024-07-26 16:26:48 -07:00
Hansung Kim
7f43bab0aa tensor: Parameterize result buffer depth 2024-07-25 16:31:45 -07:00
Hansung Kim
f3afd4a6f9 Hardcode NUM_THREADS/.. only when SYNTHESIS
They're duplicately set in VX_config.vh which is confusing.
2024-07-23 15:15:16 -07:00
Richard Yan
ed247e21bb Merge branch 'rtl' of https://github.com/hansungk/vortex-private into rtl 2024-07-20 23:37:58 -07:00
Richard Yan
7d422cc9b0 pre-submission changes 2024-07-20 23:33:56 -07:00
Hansung Kim
14b811f334 Update doc 2024-07-19 16:39:05 -07:00
Hansung Kim
4b093e3ff7 tensor: Mark PARTIAL_BW on power impact 2024-06-26 14:25:26 -07:00
Hansung Kim
9a6fe79bd3 VX_operands_dup: Add counter for RF read/write accesses 2024-06-22 16:35:23 -07:00
Hansung Kim
fb973a51b6 core_wrapper: Only terminate when core 0 is finished; more slack time 2024-06-22 16:34:42 -07:00
Hansung Kim
46fe1897bf VX_platform.vh: Undefine FIRESIM by default 2024-06-22 16:34:08 -07:00
Hansung Kim
d4f6f8a257 Set NUM_ALU_BLOCKS=2, NUM_FPU_BLOCKS=1 2024-06-22 16:33:42 -07:00
Hansung Kim
a9b75dd492 Set default to 4cores/8barriers in VX_config.{h,vh} 2024-06-12 20:51:15 -07:00
Hansung Kim
86deaa8e07 Give some slack time for other cores to finish 2024-06-12 09:47:21 -07:00
Richard Yan
1833e8a176 Merge branch 'rtl' of https://github.com/hansungk/vortex-private into rtl 2024-06-12 02:17:01 -07:00
Richard Yan
7947df8a6c config change, move ucode 2024-06-12 02:15:08 -07:00
Hansung Kim
5218292b6f core_wrapper: Use finished and !reset to determine termination 2024-06-11 16:28:05 -07:00
Hansung Kim
de10d5a957 Don't print from mem_scheduler in reset 2024-06-09 22:44:33 -07:00
Hansung Kim
5d5e4a468c Merge remote-tracking branch 'refs/remotes/origin/rtl' into rtl 2024-06-09 15:58:32 -07:00
Richard Yan
a47389fc0e Merge branch 'rtl' of https://github.com/hansungk/vortex-private into rtl 2024-06-09 15:15:31 -07:00
Richard Yan
67a13410fd gate level sim changes 2024-06-09 15:15:01 -07:00
Hansung Kim
1bacbb839f Add GPR_DUPLICATED to synthesis in VX_platform.vh 2024-06-09 14:00:34 -07:00
Hansung Kim
874a3bf194 Doc changes 2024-06-09 13:41:00 -07:00
Hansung Kim
12f8722dd5 Shush display 2024-06-03 13:04:09 -07:00
Hansung Kim
9caafb2d8a tensor: Decode rd of macro-op to designate additional accumulator
This is useful when you want to have the tensor core output to multiple
accumulator registers, e.g. when doing outer product within the RF.
2024-05-31 19:17:56 -07:00
Hansung Kim
0ebbb8e223 tensor: Fix perf counter; comment out dpi 2024-05-31 00:32:32 -07:00
Hansung Kim
73293061ea tensor: Enlarge metadata queue 2024-05-30 23:21:23 -07:00
Hansung Kim
52bb827a46 Handle BLOCK_SIZE != 1 in dispatch_unit
+ change ALU and FPU unit to use it as well
2024-05-30 23:20:21 -07:00
Hansung Kim
a02773eb92 Add more efficient dispatch_unit
Instead of having a single candidate to be considered for dispatch
(designated by 'batch_idx' counter), add a dispatch_unit variant that
considerse all `ISSUE_WIDTH dispatch signals and picks a valid one in a
round-robin manner.

This increases core utilization significantly due to better overlapping
of smem/tensor ops.
2024-05-30 21:55:42 -07:00
Hansung Kim
574cc0e5f0 tensor: Document configuring queue depths 2024-05-30 18:33:15 -07:00
Hansung Kim
83f9f6d84f tensor: Fix sync for dpu warp queue as well 2024-05-30 18:22:36 -07:00
Hansung Kim
0a032ab400 tensor: Fix out-of-sync enqueue to dpu and metadata queue 2024-05-30 18:03:04 -07:00
Hansung Kim
97f37b1c75 tensor: Add commit stall injection for debugging 2024-05-30 18:00:26 -07:00
Hansung Kim
06e0f901ff tensor: Handle backpressure from metadata queue 2024-05-30 17:34:49 -07:00
Hansung Kim
dfb2276657 tensor: Remove redundant issue queue outside pdu 2024-05-30 17:29:59 -07:00
Hansung Kim
2743d32bd2 tensor: Handle wid queue backpressure in dpu 2024-05-30 15:25:00 -07:00
Hansung Kim
2e2decc8b6 Shrink size of D_half latch 2024-05-30 12:46:45 -07:00
Hansung Kim
73a2f5781e Do two-cycle compute with 1 FEDP per lane 2024-05-30 12:41:41 -07:00
Hansung Kim
35273b3d74 Set correct dpu hmma latency 2024-05-29 17:14:54 -07:00
Hansung Kim
5ed6041e33 tensor: Properly stall dpu upon commit backpressure
& better-reasoned queue depths
2024-05-29 17:05:53 -07:00
Hansung Kim
f5a9ca5bf3 tensor: Enqueue both insts in pair to issue queue
Otherwise the first-in-pair instructions can run ahead, latching their
inputs for the next pair before the second-in-pair insts finish compute
on the current one.  Might introduce more frontend stalls, need more
experimenting
2024-05-29 14:47:25 -07:00
Hansung Kim
e9df173745 tensor: Use chisel-generated dpu module 2024-05-29 13:34:25 -07:00
Hansung Kim
c03a5b070c tensor: Issue queue for dpu to improve utilization 2024-05-27 18:25:10 -07:00
Hansung Kim
28f6cd59b5 tensor: Improve commit efficiency by decoupling dpu with fifo 2024-05-26 22:00:25 -07:00
Hansung Kim
864265bda5 tensor: Fix consecutive commits to write to same warp
... by splitting the pending_uops queue across warps.
2024-05-25 20:04:31 -07:00
Hansung Kim
5a95eba1f5 tensor: Clear c_*_tile before compute
This didn't really cause any problem, but just to be sure.
2024-05-25 19:54:44 -07:00
Hansung Kim
8775458a8f Stage half-operands per warp
An easy solution to handle multiple concurrent warp operations by
staging half-operands in their own per-warp register.  This might
increase area requirement by quite a bit.

TODO: Commit is not being handled correctly yet
2024-05-25 19:09:56 -07:00
Hansung Kim
45d86b26a2 tensor: Add counter for dpu operations 2024-05-16 22:15:01 -07:00
Hansung Kim
5034d8d14b tensor: Add buffer to hide 2cyc commit latency
Since operand and commit throughput are the same (2 cycles), it is
unnecessary to stall the dpu during the multi-cycle commit.
This enables the dpu to operate at full throughput of 1 operand every 2
cycles.
2024-05-16 20:09:08 -07:00
Hansung Kim
317695a8d0 Add perf counters on LSU resp valid tmasks 2024-05-16 15:34:54 -07:00
Hansung Kim
89e7d65926 tensor: Add ready signal to enforce 1 warp occupancy
Currently disabled as the timing behavior is already ~accurate
2024-05-16 15:34:54 -07:00
Hansung Kim
1a1094b2bb tensor: Add dispatch unit to narrow to BLOCK_SIZE=1 2024-05-16 15:34:54 -07:00
Hansung Kim
9f9ec10960 tensor: Enable scaling NUM_THREADS by octets
todo: lane-to-octet mapping is arbitrary atm
2024-05-16 15:34:50 -07:00
Richard Yan
d624b3e50a store fencing, large smem, fix tensor core for firesim 2024-05-15 21:45:48 -07:00
Richard Yan
0dd5335851 fix merge error once again 2024-05-08 11:31:43 -07:00
Richard Yan
16dfae7d3f Merge branch 'rtl' of https://github.com/hansungk/vortex-private into rtl 2024-05-08 11:28:39 -07:00
Richard Yan
629279977e fix merge error 2024-05-08 11:28:36 -07:00
Hansung Kim
be748b109a Fix faulty merge on syn-only flags 2024-05-07 18:37:25 -07:00
Hansung Kim
f71e705d53 Revert to old LSUQ_SIZE 2024-05-07 16:23:32 -07:00
Richard Yan
4aad161739 Merge branch 'rtl' of https://github.com/hansungk/vortex-private into rtl 2024-05-07 14:00:31 -07:00
Richard Yan
37616f3334 firesim modifications 2024-05-07 13:59:25 -07:00
Richard Yan
c9a3eaad79 accelerator cisc 2024-05-07 13:58:32 -07:00
Richard Yan
14d1552f08 potential deadlock 2024-05-07 13:56:51 -07:00
Richard Yan
1e5dff52c1 shrink queue sizes 2024-05-07 13:54:23 -07:00
Hansung Kim
868bbdb15e tensor: more doc 2024-05-07 13:54:10 -07:00
Richard Yan
b70df8cbc9 proper srams 2024-05-07 13:52:07 -07:00
Hansung Kim
9c1d797250 tensor: add missing } 2024-05-05 18:36:15 -07:00
Hansung Kim
fb626ee21c tensor: doc 2024-05-05 18:35:52 -07:00
Hansung Kim
9ea291eea2 Merge remote-tracking branch 'origin/tensor_core' into rtl 2024-05-05 17:03:57 -07:00
joshua
5bd25985c6 i kinda forgot most of changes 2024-05-04 23:01:47 -07:00
Hansung Kim
1c7acab160 tensor: Fix lint errors 2024-05-03 15:43:02 -07:00
Hansung Kim
5a0ee98a61 Remove duplicate port connection 2024-05-03 15:07:24 -07:00
Hansung Kim
bc45c40231 tensor: Rename half.hpp -> half.h
addResource() thinks it's a Verilog source file if it ends in .hpp, for
some reason.
2024-05-02 16:17:20 -07:00
Hansung Kim
c4b94e4f2c Wrap hardcoded configs with SYNTHESIS 2024-05-02 16:17:04 -07:00
Hansung Kim
c4d71bc3d6 tensor: Fix multiple driver error on VCS 2024-05-01 21:40:48 -07:00
Hansung Kim
7fc5b6a374 tensor: Fix elaboration error on VCS 2024-05-01 21:40:45 -07:00
Hansung Kim
675e8ea130 Merge branch 'tensor_core' into rtl 2024-05-01 16:18:14 -07:00
Hansung Kim
9a688a05b1 Add (unconnected) FPU perf counters
mainly for debugging
2024-04-29 15:20:55 -07:00
Hansung Kim
100fbbc048 Increase FPUQ_SIZE
This should at least be FMA_LATENCY to not bottleneck things.
2024-04-29 15:19:48 -07:00
Richard Yan
85213d2876 synthesizable design 2024-04-17 18:05:51 -07:00
Richard Yan
17fd29c114 Merge branch 'rtl' of https://github.com/hansungk/vortex-private into rtl 2024-04-16 23:03:04 -07:00
Richard Yan
8de5470da4 round robin warp scheduling 2024-04-16 23:03:00 -07:00
Hansung Kim
217bc189da ifdef-guard VX_operand* to enable including both in Chisel 2024-04-15 22:06:47 -07:00
Hansung Kim
4752b86858 Limit NUM_SFU_LANES to 4
Simulation seems to not like SFU_LANES=8; dial back for now
2024-04-15 21:48:59 -07:00
Hansung Kim
978b1fe2d0 Add operands stage with duplicated RF for rs1/2/3 2024-04-15 16:45:59 -07:00
Hansung Kim
87b966a5fa Add perf counter for stall by any operand hazard 2024-04-15 01:01:26 -07:00
Hansung Kim
7ae54bd280 Remove unused IO in core_wrapper 2024-04-13 17:13:39 -07:00
Richard Yan
d3e0f18fd5 Merge branch 'rtl' of https://github.com/hansungk/vortex-private into rtl 2024-04-09 19:55:11 -07:00
Richard Yan
41a79a03a4 parametrize memory interface in core wrapper and update config.vh 2024-04-09 19:55:06 -07:00
Hansung Kim
6c632200d5 Divide by per-breakdown cycle for avg stall cycles 2024-04-03 15:29:51 -07:00
Hansung Kim
62c7d1f4cf Report any fire cycles from scoreboard as well 2024-03-29 12:23:15 -07:00
Hansung Kim
50263a5f7d Rename sched_barrier_stalls -> perf_sched_barrier_idles
Sched stall by barrier is really idle because it causes !scheduler_if.valid,
which is counted as part of sched_idle.
2024-03-28 22:45:12 -07:00
joshua
d8f9359fae test case update 2024-03-28 13:04:02 -07:00
joshua
08d7721e11 annoying swizzling problems 2024-03-28 03:00:15 -07:00
joshua
e16584ddd9 bleh still not work 2024-03-27 00:26:04 -07:00
Hansung Kim
dd90736382 Reformat perfcount report 2024-03-23 01:07:46 -07:00
Hansung Kim
3e6a9a6104 Expose scoreboard fires to perf interface 2024-03-23 01:06:40 -07:00
Hansung Kim
d99295793c Periodically report perf counter; reformat operand/FU stalls 2024-03-23 00:02:02 -07:00
Hansung Kim
83e151a189 Add valid / fire / cycles-issued perf counters to dispatch 2024-03-23 00:01:15 -07:00
Hansung Kim
573be030c8 Add issue-stall-by-operand-hazard perf counters
Do the same reduce by + instead of OR fix for scoreboard counters.
2024-03-23 00:00:08 -07:00
Hansung Kim
dda67da84c Add issue-stall-by-unit-busy perf counters
Add per-issue-width counters instead of using reduce "OR" and causing
undercounting.
2024-03-21 18:11:12 -07:00
Hansung Kim
3718a57937 Docs 2024-03-21 15:44:50 -07:00
joshua
b254281295 initial tcore impl 2024-03-21 01:29:38 -07:00
Hansung Kim
9438862389 Add perf counter for barrier schedule stalls 2024-03-20 15:29:28 -07:00
joshua
f9b4509936 initial tensor core 2024-03-20 02:46:00 -07:00
joshua
978dd3bdfe seemingly working fp32 implementation 2024-03-19 17:56:59 -07:00
Hansung Kim
7014ae24da Prettier perf count reports 2024-03-19 15:25:46 -07:00
Hansung Kim
b25deb8a2e Fix assignment for perf counters 2024-03-19 14:06:44 -07:00
Hansung Kim
df4b21507e Customize global barrier response logic for clusters 2024-03-18 14:30:32 -07:00
Hansung Kim
2525df9c5f Use GBAR_CLUSTER_ENABLE to guard cluster-specific modification 2024-03-17 18:24:04 -07:00
Hansung Kim
7f8abe99ff Fix wrong multicore parametrization in wrapper 2024-03-17 18:23:09 -07:00
Hansung Kim
40e2888733 Connect core gbar signals in wrapper 2024-03-17 14:09:43 -07:00
Hansung Kim
28f54bde7f Merge remote-tracking branch 'sungwoong/master' into rtl 2024-03-14 09:15:59 -07:00
Hansung Kim
bd67ff3439 Fix creating bogus mem reqs when commit is stalled
When commit stage is stalled, LSU ready is deasserted for mem writes
since stores commit immediately; however, the same was not applied to
valid, creating duplicate memory write requests.  Fix by guarding both
ready and valid properly.
2024-03-13 20:43:27 -07:00
Hansung Kim
8317a3fbe5 Fix fence by disallowing x-initialization instead of all-0 mask
Setting mem_req_mask to all-zero triggers an assertion error in
mem_scheduler.  Instead, disallow initialize-by-x in instruction decode
which is the source of x-propagation.  Since this seems to only happen
in VCS, define-gate it accordingly.

This reverts commit a15f4fd483.
2024-03-07 17:39:18 -08:00
Hansung Kim
010c4675ce Fix undeclared mem_perf_if 2024-03-07 15:00:43 -08:00
Hansung Kim
b63333a4ec Merge remote-tracking branch 'upstream/master' into vortex2 2024-03-07 14:45:48 -08:00
joshua
beb3dce46d integer reduction unit 2024-03-06 01:39:17 -08:00
Hansung Kim
e7b0a149c7 Print TAG_ONLY_WIDTH of req_tag in trace
... for use in trace parser.  Full req_tag includes debug information
that complicates matching request to a corresponding request by tag.
2024-03-04 21:10:59 -08:00
Sungwoong Ha
3c2a266d37 second pass 2024-03-01 21:27:26 -08:00
Sungwoong Ha
a9709edae2 first pass 2024-03-01 21:05:52 -08:00
Sungwoong Ha
be7d87c82d temp 2024-02-22 16:31:42 -08:00
Blaise Tine
5f2b10b8a6 minor update 2024-02-09 21:20:23 -08:00
Blaise Tine
3fee1a6193 minor update 2024-02-09 20:34:44 -08:00
Blaise Tine
ae7b01405c CI minor update 2024-02-08 14:10:00 -08:00
Blaise Tine
be0db6e1a5 minor update 2024-02-04 20:32:05 -08:00
Blaise Tine
50028c1a33 Merge remote-tracking branch 'origin' into develop 2024-02-04 20:19:30 -08:00
Blaise Tine
8d4b6c804f minor update 2024-02-04 20:17:12 -08:00
Blaise Tine
6f7a389a1f arbiters unlock refactoring 2024-02-04 20:16:18 -08:00
Blaise Tine
fe15647f98 minor update 2024-02-04 02:11:53 -08:00
Blaise Tine
b0b7cd2b1e minor updates 2024-02-03 19:09:53 -08:00
Hansung Kim
eb63767051 Don't hardcode SIMULATION 2024-02-01 23:58:06 -08:00
Hansung Kim
48558982f7 Merge remote-tracking branch 'upstream/master' into vortex2 2024-02-01 23:35:58 -08:00
Hansung Kim
a15f4fd483 [BUGFIX] Set mem_req_mask to 0 for fence
Fence instructions have address field set to X's which propagates to
cache_req_ready, causing issue stalls.  Fix this by setting req_mask to all-zero
so that they can be handled unaffected by x-propagation.

Setting req_valid to 0 does not fix the problem because the LSU only commits
instructions when they have a matching response coming back.
2024-02-01 22:44:33 -08:00
Blaise Tine
f9cd8be19e minor update 2024-01-31 13:35:43 -08:00
Blaise Tine
dab262e4f7 Merge branch 'develop' of https://github.com/vortexgpgpu/vortex into develop 2024-01-31 12:03:50 -08:00
Blaise Tine
8ab7c590fd disabling fetch's deadlock check when L1 caches are present 2024-01-31 06:16:54 -08:00
Blaise Tine
e2d1387df8 elastic buffers classification 2024-01-31 00:39:37 -08:00
Shinnung Jeong
fd65ed95eb fix bug to access memory address in simx 2024-01-30 20:45:47 -05:00
Blaise Tine
b31d868a27 Merge branch 'develop' 2024-01-28 17:34:46 -08:00
Blaise Tine
b6919d19a7 minor update 2024-01-28 17:34:07 -08:00
Blaise Tine
6045597ad0 Merge branch 'develop' 2024-01-28 00:25:55 -08:00
Blaise Tine
1c1140d517 Merge branch 'develop' of https://github.com/vortexgpgpu/vortex into develop 2024-01-28 00:25:16 -08:00
Blaise Tine
38b92ad592 - using SV_DPI defines to disable DPI in synthesis-based simulations
- fixed Intel ASE run script: run_ase.sh
2024-01-28 00:22:21 -08:00
Hansung Kim
4643edf3e9 Properly determine core finish 2024-01-26 14:23:52 -08:00
Hansung Kim
c9d1275f0e Define SIMULATION under VERILATOR 2024-01-25 23:23:34 -08:00
Hansung Kim
b9b675a288 Add VX_config.h 2024-01-25 23:23:17 -08:00
Hansung Kim
60d4180249 Increase LSUQ and IBUF size 2024-01-16 23:53:14 -08:00
lpc97667
a9d578f3ab Docs update 2024-01-10 15:56:22 -05:00
Hansung Kim
62171c0788 Change dmem/smem width to LSU lanes not core lanes 2024-01-04 01:34:24 -08:00
Hansung Kim
fd425f1cdf Change smem bundles into flattened 1-D arrays 2024-01-04 00:52:56 -08:00
Hansung Kim
e6f6d4ea06 Change dmem bundles into flattened 1-D arrays 2024-01-04 00:37:59 -08:00
Blaise Tine
f0e6a435f8 Merge branch 'develop' 2024-01-03 19:09:49 -08:00
Blaise Tine
648bf75b0b minor update 2024-01-03 19:09:18 -08:00
Blaise Tine
3b75418ea9 Merge branch 'develop' 2024-01-03 10:24:48 -08:00
Blaise Tine
f2e8317412 updated documentation 2024-01-03 10:23:38 -08:00
Hansung Kim
ab55c04d0c Add localparam for internal/external smem switch 2024-01-01 19:48:06 -08:00
Hansung Kim
b64f0c2794 Add if-stmt to switch between external/internal smem 2024-01-01 12:46:59 -08:00
Hansung Kim
22f656fec1 Add ports for smem TL and connect to smem bus 2024-01-01 02:22:49 -08:00
Hansung Kim
b6cc0c285e Remove unused tilelink ports in VX_core_wrapper 2024-01-01 01:09:55 -08:00
Hansung Kim
144521e19c Expose smem ports at VX_core top
smem_unit stays inside the core, and the two separate buses to dcache
and smem are exposed at VX_core.

Currently core_wrapper ties req valid to 1'b0, stalling kernels that
reads from sharedmem.
2023-12-31 23:57:31 -08:00
Blaise Tine
cc042a4098 Merge branch 'develop' 2023-12-31 15:30:20 -08:00
Blaise Tine
bd18b03cc3 minor update 2023-12-31 15:29:04 -08:00
Blaise Tine
e7f8b40d93 minor update 2023-12-31 11:46:41 -08:00
Blaise Tine
ec2a35def9 Merge branch 'develop' 2023-12-31 11:26:48 -08:00
Blaise Tine
031d24e695 minor updates 2023-12-30 00:52:44 -08:00
Hansung Kim
158624bc1b Write operand to file in matmul kernel 2023-12-30 00:28:55 -08:00
Blaise Tine
645ca62c91 Merge branch 'develop' 2023-12-29 15:14:23 -08:00
Blaise Tine
7425446b15 fixed DESTDIR support in simumation Makefiles 2023-12-29 14:11:16 -08:00
Blaise Tine
a7548db5ec Merge branch 'develop' 2023-12-28 20:08:12 -08:00
Blaise Tine
e62d122c9b enabling temporary build directory for blackbox multiple instances 2023-12-28 20:06:10 -08:00
Blaise Tine
e8cbfb4a72 Merge branch 'develop' 2023-12-28 16:11:29 -08:00
Blaise Tine
51e621cdf1 minor update 2023-12-28 16:08:26 -08:00
Blaise Tine
afea903332 Merge branch 'develop' 2023-12-28 12:33:58 -08:00
Blaise Tine
36f5dd87fe minor update 2023-12-28 12:22:22 -08:00
Blaise Tine
e217bc2c23 adding tracking for SFU stalls 2023-12-28 12:12:11 -08:00
Blaise Tine
c7a81d1493 adding sockets support to simx and cache subsystem refactoring
minor update

minor update

minor updates
2023-12-20 15:16:12 -08:00
Blaise Tine
914b680aed operands optimization
minor updates

minor updates

minor update

operands optimization

minor updates

minor updates
2023-12-20 15:07:23 -08:00
Blaise Tine
2c6d84bac9 Merge branch 'develop' of https://github.com/vortexgpgpu/vortex into develop 2023-12-18 12:54:13 -08:00
Blaise Tine
39e6f95c2b operands optimization
minor updates

minor updates

minor update
2023-12-18 12:53:34 -08:00
Blaise Tine
5a2bc88d20 operands optimization
minor updates

minor updates
2023-12-18 04:44:01 -08:00
Blaise Tine
e04e026a14 profiling update
minor updates
2023-12-18 04:43:44 -08:00
Blaise Tine
c6845a4c8d profiling timing optimization
minor update

minor update

minor update
2023-12-18 04:43:10 -08:00
Blaise Tine
f5f9e3dfdb profiling timing optimization 2023-12-18 04:43:10 -08:00
Blaise Tine
6c7ac35054 profiling optimizations
minor updates
2023-12-18 04:43:00 -08:00
Blaise Tine
e5b41bcd66 wctl unit bug fix 2023-12-05 04:57:52 -08:00
Blaise Tine
1912f52bee profiling bug fix 2023-12-05 04:56:46 -08:00
root
900a1efaca BUFFER_EX refactoring 2023-12-05 04:55:50 -08:00
root
d288fb360c Merge branch 'develop' of https://github.com/vortexgpgpu/vortex into develop 2023-12-05 04:50:20 -08:00
Hyesoon Kim
63a4ccef16 Merge pull request #95 from Udit8348/develop-documentation
Documentation for Testing and Contributing
2023-12-01 09:20:21 -05:00
Udit Subramanya
0d5887b938 Merge branch 'develop' into develop-documentation
Attempted to directly push to develop, but permission was denied.
Therefore, I moved my changes to my development branch located on my fork.
I have permission to commit changes to my fork, and I can open a PR to bring those changes into main repo
2023-12-01 08:56:17 -05:00
Udit Subramanya
a43b7432a0 add environment setup readme 2023-12-01 08:55:01 -05:00
Udit Subramanya
af94d24963 Merge branch 'develop' into develop-documentation 2023-12-01 08:49:46 -05:00
Udit Subramanya
b20320236d adding documemtation for contributing and documentation 2023-12-01 08:22:44 -05:00
Hansung Kim
5825680303 [BUGFIX] Revert way_idx fix
The added code results in width mismatch for NUM_WAYS = 4.
2023-11-28 18:44:47 -08:00
Hansung Kim
c3c9a4b5d8 [BUGFIX] Fix wrong bitwidth of way_idx when NUM_WAYS=1
When NUM_WAYS=1, CLOG2(NUM_WAYS)-1 becomes -1, setting the MSB of
way_idx to a wrong value.
2023-11-28 16:05:41 -08:00
Hansung Kim
9a8020a683 Force-include gpu_pkg in VX_cache_define.vh 2023-11-28 13:55:11 -08:00
Blaise Tine
9c2916f3fc minor update 2023-11-28 12:03:48 -08:00
Blaise Tine
e8d56dc013 minor update 2023-11-27 22:16:36 -08:00
Hansung Kim
5e5c625759 Write 0 instead of x for VX_CSR_MPM_RESERVED
Otherwise it makes verification hard with tools that don't process x's
well.
2023-11-27 16:06:16 -08:00
Hansung Kim
f41b50fc07 Define DBG_TRACE_CORE_PIPELINE_VCS for selective debug trace 2023-11-27 16:05:15 -08:00
Blaise Tine
24973ffca0 scoreboard optimization & profiling 2023-11-27 05:53:36 -08:00
Blaise Tine
4b68235389 fixed simx dispatcher bug 2023-11-27 04:50:55 -08:00
Blaise Tine
9dc5793046 minor udpate 2023-11-27 02:21:47 -08:00
Blaise Tine
1271c9c03f minor update 2023-11-27 02:12:12 -08:00
Blaise Tine
ebec982434 minor update 2023-11-27 02:04:53 -08:00
Blaise Tine
2f1171ca76 minor update 2023-11-27 02:04:22 -08:00
Hansung Kim
99207c862c Revert PutPartial -> PutFull spoofing 2023-11-19 17:48:38 -08:00
Blaise Tine
11752b2562 Merge branch 'develop' of https://github.com/vortexgpgpu/vortex into develop 2023-11-18 00:27:46 -08:00
Hansung Kim
e2d4894343 Add missing valid bit check for write acks 2023-11-17 20:32:53 -08:00
Hansung Kim
bc71c126ef Fix STORE HEAP trace print in verilog wrapper 2023-11-17 20:25:01 -08:00
Hansung Kim
faf5fe3838 Assert ready when write response is coming back
Since the core's response ready signal depends on response valid, but core does
not accept write ACKs, we need to manually assert ready when there is a valid
response coming in for a write regardless of the core's ready state (which would
be 0).
2023-11-17 19:08:32 -08:00
Hansung Kim
90e21e8e58 [CHANGE] Work around uninitialized signal issue with === operator
It seems many of the initial arch/uarch states, including the GPR, are
uninitialized in the VCS simulation, which results in functional errors caused
by propagated X's.  In this particular case it resulted in a dcache request not
being fired due to the rs1 data for an lw instruction having values as X,
causing the smem_unit to not arbitrate the request correctly.

A workaround of this issue is to stop the X propagation by using the
===-operation instead of == in the GPR unit, which had been the main source of X
propagation into the raddr port of the GPR.

Also, we run the simulation with GSR_RESET set to 1 so that the contents of the
GPR are initialized at the beginning of the simulation (however, this alone does
not prevent reading in X's, hence this fix.)

FIXME: This is a slight deviation from the upstream code; ideally, we want to do
clean & full initialization of microarchitectural states.
2023-11-17 17:20:54 -08:00
Hansung Kim
9651cc6bc5 Fix wrong dcache tag width in wrapper
Need to use DCACHE_NOSM_TAG_WIDTH instead of DCACHE_TAG_WIDTH; otherwise, the
`ASSIGN_VX_MEM_BUS_IF macro in VX_smem_unit.sv does assignment of packed structs
with different widths for the tag field, resulting in misaligned bit error.
This results in wrong memory addresses for the core requests.
2023-11-17 17:12:41 -08:00
Blaise Tine
43154cf738 minor updates 2023-11-16 23:41:59 -08:00
Hansung Kim
e2d3d93dea Properly initialize DCR in wrapper code 2023-11-16 17:59:57 -08:00
Blaise Tine
d65cc61df5 minor update 2023-11-16 12:00:37 -08:00
Hansung Kim
963c2765d9 Move force-include of gpu_pkg to non-cache modules 2023-11-15 22:02:44 -08:00
Hansung Kim
448a253af3 Add Verilog wrapper module for VX_core 2023-11-15 20:09:53 -08:00
Hansung Kim
bbacf9a25e Remove verilated vpi code, add missing includes for C++
Vortex rtlsim defines sim_trace_enabled... functions in the Verilated
C++ code for use in dpi_trace, which we don't need.
2023-11-15 20:06:58 -08:00
Hansung Kim
d9cb14d6e4 Fix include path in rvfloats.cpp to work with Chisel addResources
addResource() in Chisel flattens everything to gen-collateral/ dir, so
cannot use relative path for includes.
2023-11-15 20:06:18 -08:00
Hansung Kim
7e0b63a3b3 Change result type for dpi calls from wire -> reg
VCS requires the output of the dpi calls to be of a type that can come
at the LHS of a procedural assignment, i.e. reg type.  Seems to be a
different requirement from Verilator.
2023-11-15 19:26:12 -08:00
Hansung Kim
d2d7ee61bb Define SIMULATION for VCS in VX_platform.vh 2023-11-15 19:14:58 -08:00
Blaise Tine
547d916ae2 minor update 2023-11-15 13:00:06 -08:00
Blaise Tine
2c94e358b8 perf counter bug fix 2023-11-15 00:52:39 -08:00
Blaise Tine
ede5e1c311 minor update 2023-11-15 00:28:26 -08:00
Hansung Kim
512fc0da1c Copy VX_platform macros for VCS from VERILATOR 2023-11-15 00:20:18 -08:00
Hansung Kim
20a9e6d102 Force include VX_gpu_pkg as compile order workaround
addResource() calls in Chisel BlackBox does not preserve order of the
files being included; the actual compile order for these files are
re-arranged to be in alphabetical order.

Therefore, while VX_gpu_pkg.sv has to be compiled before all the other
modules because it holds the top-level package definition, that order
cannot be ensured from Chisel.  As a hacky workaround, simply `include
this file in some of the sv files whose name starts earlier than
VX_gpu_pkg in lexicographical order.
2023-11-14 23:00:43 -08:00
Blaise Tine
61e3442ef8 adding opencl convolution benchmark 2023-11-14 22:31:30 -08:00
Blaise Tine
4e7a536918 adding tensor regression test. 2023-11-14 05:37:46 -08:00
Blaise Tine
ecf546bc4a minor update 2023-11-13 20:00:39 -08:00
Blaise Tine
b274b8cc21 minor updates 2023-11-13 00:23:15 -08:00
Blaise Tine
a08d3ebd42 minor update 2023-11-12 23:40:59 -08:00
Blaise Tine
62cdd8e993 minor update 2023-11-11 15:49:39 -08:00
Blaise Tine
64dc5e1667 Merge branch 'develop' 2023-11-10 02:57:42 -08:00
Blaise Tine
c1e168fdbe Vortex 2.0 changes:
+ Microarchitecture optimizations
+ 64-bit support
+ Xilinx FPGA support
+ LLVM-16 support
+ Refactoring and quality control fixes

minor update

minor update

minor update

minor update

minor update

minor update

cleanup

cleanup

cache bindings and memory perf refactory

minor update

minor update

hw unit tests fixes

minor update

minor update

minor update

minor update

minor update

minor udpate

minor update

minor update

minor update

minor update

minor update

minor update

minor update

minor updates

minor updates

minor update

minor update

minor update

minor update

minor update

minor update

minor updates

minor updates

minor updates

minor updates

minor update

minor update
2023-11-10 02:47:05 -08:00
Blaise Tine
6e93787e59 minor update 2023-11-06 00:16:24 -08:00
Blaise Tine
e0becb1599 minor update 2023-11-05 20:03:31 -08:00
Blaise Tine
d13c5f2986 hw unit tests fixes 2023-11-05 18:51:31 -08:00
Blaise Tine
1fd5a95f5a minor update 2023-11-03 18:04:05 -04:00
Blaise Tine
9f1f1ecaa3 minor update 2023-11-03 08:36:28 -04:00
Blaise Tine
c9e6518e05 cache bindings and memory perf refactory 2023-11-03 08:18:18 -04:00
Blaise Tine
69f9ae778d cleanup 2023-11-03 08:12:03 -04:00
Blaise Tine
970cbf066a cleanup 2023-11-03 08:09:59 -04:00
Blaise Tine
1c100c4cf5 minor update 2023-10-22 23:31:58 -07:00
Blaise Tine
cb7d6b964c minor update 2023-10-22 02:25:34 -07:00
Blaise Tine
8cf833b7eb minor update 2023-10-21 19:12:07 -07:00
Blaise Tine
8fe373891f minor update 2023-10-21 17:55:29 -07:00
Blaise Tine
3cacb4f80f minor update 2023-10-20 02:21:20 -07:00
Blaise Tine
65ca0fff3a minor update 2023-10-20 00:48:05 -07:00
Blaise Tine
d47cccc157 Vortex 2.0 changes:
+ Microarchitecture optimizations
+ 64-bit support
+ Xilinx FPGA support
+ LLVM-16 support
+ Refactoring and quality control fixes
2023-10-19 20:51:22 -07:00
1402 changed files with 260282 additions and 327014 deletions

5
.gitmodules vendored
View File

@@ -1,12 +1,9 @@
[submodule "third_party/fpnew"]
path = third_party/fpnew
url = https://github.com/pulp-platform/fpnew.git
url = https://github.com/richardyrh/cvfpu.git
[submodule "third_party/softfloat"]
path = third_party/softfloat
url = https://github.com/ucb-bar/berkeley-softfloat-3.git
[submodule "third_party/cocogfx"]
path = third_party/cocogfx
url = https://github.com/gtcasl/cocogfx.git
[submodule "third_party/ramulator"]
path = third_party/ramulator
url = https://github.com/CMU-SAFARI/ramulator.git

View File

@@ -1,76 +1,90 @@
language: cpp
dist: bionic
dist: focal
os: linux
compiler: gcc
addons:
apt:
sources:
- ubuntu-toolchain-r-test
packages:
- build-essential
- valgrind
- verilator
- yosys
- libpng-dev
- libboost-serialization-dev
- libstdc++6
install:
# Set environments
- export RISCV_TOOLCHAIN_PATH=/opt/riscv-gnu-toolchain
- export VERILATOR_ROOT=/opt/verilator
- export PATH=$VERILATOR_ROOT/bin:$PATH
# Install toolchain
- ci/toolchain_install.sh -all
# build project
- make -s
addons:
apt:
packages:
- build-essential
- valgrind
- libstdc++6
env:
global:
- TOOLDIR=$HOME/tools
cache:
directories:
- $TOOLDIR
- $HOME/build32
- $HOME/build64
before_install:
- if [ ! -d "$TOOLDIR" ] || [ -z "$(ls -A $TOOLDIR)" ]; then
mkdir -p $TOOLDIR;
OSDIR=ubuntu/focal ./ci/toolchain_install.sh --all;
fi
- source ./ci/toolchain_env.sh
# stages ordering
stages:
- setup
- test
jobs:
include:
- stage: test
name: coverage
script: cp -r $PWD ../build_coverage && cd ../build_coverage && ./ci/travis_run.py ./ci/regression.sh -coverage
- stage: test
name: coverage64
script: cp -r $PWD ../build_coverage64 && cd ../build_coverage64 && ./ci/travis_run.py ./ci/regression64.sh -coverage
- stage: test
name: tex
script: cp -r $PWD ../build_tex && cd ../build_tex && ./ci/travis_run.py ./ci/regression.sh -tex
- stage: test
- stage: setup
script:
- rm -rf $HOME/build32 && cp -r $PWD $HOME/build32
- rm -rf $HOME/build64 && cp -r $PWD $HOME/build64
- make -C $HOME/build32 > /dev/null
- XLEN=64 make -C $HOME/build64 > /dev/null
- stage: test
name: unittest
script: cp -r $HOME/build32 build && cd build && ./ci/travis_run.py ./ci/regression.sh --unittest
- stage: test
name: isa
script: cp -r $HOME/build32 build && cd build && ./ci/travis_run.py ./ci/regression.sh --isa
- stage: test
name: isa64
script: cp -r $HOME/build64 build && cd build && XLEN=64 ./ci/travis_run.py ./ci/regression.sh --isa
- stage: test
name: regression
script: cp -r $HOME/build32 build && cd build && ./ci/travis_run.py ./ci/regression.sh --regression
- stage: test
name: regression64
script: cp -r $HOME/build64 build && cd build && XLEN=64 ./ci/travis_run.py ./ci/regression.sh --regression
- stage: test
name: opencl
script: cp -r $HOME/build32 build && cd build && ./ci/travis_run.py ./ci/regression.sh --opencl
- stage: test
name: cluster
script: cp -r $PWD ../build_cluster && cd ../build_cluster && ./ci/travis_run.py ./ci/regression.sh -cluster
- stage: test
script: cp -r $HOME/build32 build && cd build && ./ci/travis_run.py ./ci/regression.sh --cluster
- stage: test
name: config
script: cp -r $PWD ../build_config && cd ../build_config && ./ci/travis_run.py ./ci/regression.sh -config
script: cp -r $HOME/build32 build && cd build && ./ci/travis_run.py ./ci/regression.sh --config
- stage: test
name: debug
script: cp -r $PWD ../build_debug && cd ../build_debug && ./ci/travis_run.py ./ci/regression.sh -debug
- stage: test
script: cp -r $HOME/build32 build && cd build && ./ci/travis_run.py ./ci/regression.sh --debug
- stage: test
name: stress0
script: cp -r $PWD ../build_stress0 && cd ../build_stress0 && ./ci/travis_run.py ./ci/regression.sh -stress0
- stage: test
script: cp -r $HOME/build32 build && cd build && ./ci/travis_run.py ./ci/regression.sh --stress0
- stage: test
name: stress1
script: cp -r $PWD ../build_stress1 && cd ../build_stress1 && ./ci/travis_run.py ./ci/regression.sh -stress1
- stage: test
name: compiler
script: cp -r $PWD ../build_compiler && cd ../build_compiler && ./ci/travis_run.py ./ci/test_compiler.sh
- stage: test
name: tex
script: cp -r $PWD ../build_tex && cd ../build_tex && ./ci/travis_run.py ./ci/regression.sh -tex
- stage: test
name: unittest
script: cp -r $PWD ../build_unittest && cd ../build_unittest && ./ci/travis_run.py ./ci/regression.sh -unittest
script: cp -r $HOME/build32 build && cd build && ./ci/travis_run.py ./ci/regression.sh --stress1
- stage: test
name: synthesis
script: cp -r $HOME/build32 build && cd build && ./ci/travis_run.py ./ci/regression.sh --synthesis
- stage: test
name: synthesis64
script: cp -r $HOME/build64 build && cd build && XLEN=64 ./ci/travis_run.py ./ci/regression.sh --synthesis
after_success:
# Gather code coverage
- lcov --directory driver --capture --output-file driver.cov # capture trace
- lcov --directory simx --capture --output-file simx.cov # capture trace
- lcov --list driver.cov # output coverage data for debugging
- lcov --list simx.cov # output coverage data for debugging
- lcov --directory runtime --capture --output-file runtime.cov # capture trace
- lcov --directory sim --capture --output-file sim.cov # capture trace
- lcov --list runtime.cov # output coverage data for debugging
- lcov --list sim.cov # output coverage data for debugging
# Upload coverage report
- bash <(curl -s https://codecov.io/bash) -f driver.cov
- bash <(curl -s https://codecov.io/bash) -f simx.cov
- bash <(curl -s https://codecov.io/bash) -f runtime.cov
- bash <(curl -s https://codecov.io/bash) -f sim.cov

221
LICENSE
View File

@@ -1,24 +1,201 @@
Copyright (c) <2020>, <Georgia Institute of Technology>
All rights reserved.
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of the Georgia Institute of Technology nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL COPYRIGHT HOLDER BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

View File

@@ -2,13 +2,27 @@ all:
$(MAKE) -C third_party
$(MAKE) -C hw
$(MAKE) -C sim
$(MAKE) -C driver
$(MAKE) -C kernel
$(MAKE) -C runtime
$(MAKE) -C tests
clean:
$(MAKE) -C hw clean
$(MAKE) -C sim clean
$(MAKE) -C driver clean
$(MAKE) -C kernel clean
$(MAKE) -C runtime clean
$(MAKE) -C tests clean
$(MAKE) -C tests clean
clean-all:
$(MAKE) -C third_party clean
$(MAKE) -C hw clean
$(MAKE) -C sim clean
$(MAKE) -C kernel clean
$(MAKE) -C runtime clean
$(MAKE) -C tests clean-all
crtlsim:
$(MAKE) -C sim clean
brtlsim:
$(MAKE) -C sim

View File

@@ -1,22 +1,25 @@
[![Build Status](https://travis-ci.com/vortexgpgpu/vortex.svg?branch=master)](https://travis-ci.com/vortexgpgpu/vortex)
[![codecov](https://codecov.io/gh/vortexgpgpu/vortex/branch/master/graph/badge.svg)](https://codecov.io/gh/vortexgpgpu/vortex)
# Vortex OpenGPU
# Vortex GPGPU
Vortex is a full-system RISCV-based GPGPU processor.
Vortex is a full-stack open-source RISC-V GPGPU.
## Specifications
- Support RISC-V RV32IMF ISA
- Performance:
- 1024 total threads running at 250 MHz
- 128 Gflops of compute bandwidth
- 16 GB/s of memory bandwidth
- Scalability: up to 64 cores with optional L2 and L3 caches
- Software: OpenCL 1.2 Support
- Support RISC-V RV32IMAF and RV64IMAFD
- Microarchitecture:
- configurable number of cores, warps, and threads.
- configurable number of ALU, FPU, LSU, and SFU units per core.
- configurable pipeline issue width.
- optional shared memory, L1, L2, and L3 caches.
- Software:
- OpenCL 1.2 Support.
- Supported FPGAs:
- Intel Arria 10
- Intel Stratix 10
- Altera Arria 10
- Altera Stratix 10
- Xilinx Alveo U50, U250, U280
- Xilinx Versal VCK5000
## Directory structure
@@ -30,14 +33,20 @@ Vortex is a full-system RISCV-based GPGPU processor.
- `miscs`: Miscellaneous resources.
## Build Instructions
More detailed build instructions can be found [here](docs/install_vortex.md).
### Supported OS Platforms
- Ubuntu 18.04
- Ubuntu 18.04, 20.04
- Centos 7
### Toolchain Dependencies
- [POCL](http://portablecl.org/)
- [LLVM](https://llvm.org/)
- [RISCV-GNU-TOOLCHAIN](https://github.com/riscv-collab/riscv-gnu-toolchain)
- [Verilator](https://www.veripool.org/verilator)
- [FpNew](https://github.com/pulp-platform/fpnew.git)
- [SoftFloat](https://github.com/ucb-bar/berkeley-softfloat-3.git)
- [Ramulator](https://github.com/CMU-SAFARI/ramulator.git)
- [Yosys](https://github.com/YosysHQ/yosys)
- [Sv2v](https://github.com/zachjs/sv2v)
### Install development tools
$ sudo apt-get install build-essential
$ sudo apt-get install git
@@ -45,8 +54,12 @@ Vortex is a full-system RISCV-based GPGPU processor.
$ git clone --recursive https://github.com/vortexgpgpu/vortex.git
$ cd Vortex
### Install prebuilt toolchain
$ ./ci/toolchain_install.sh -all
By default, the toolchain will install to /opt folder which requires sudo access.
You can install the toolchain to a different location of your choice by setting TOOLDIR (e.g. export TOOLDIR=$HOME/tools).
$ export TOOLDIR=/opt
$ ./ci/toolchain_install.sh --all
$ source ./ci/toolchain_env.sh
### Build Vortex sources
$ make -s
### Quick demo running vecadd OpenCL kernel on 2 cores
$ ./ci/blackbox.sh --driver=rtlsim --cores=2 --app=vecadd
$ ./ci/blackbox.sh --cores=2 --app=vecadd

View File

@@ -1,4 +0,0 @@
Release Notes!
* 07/01/2020 - LKG FPGA build - Passed basic, demo, vecadd kernels.

23
TODO
View File

@@ -1,23 +0,0 @@
Functionality:
1) vx_cl_warpSpawn()
-> To be used by pocl->ops->run
2) newlib Integration (LoadFile(""))
-> To be used by the Rhinio benchmarks
3) POCL OPS Vortex Suite
Performance:
1) Icache doesn't need SEND_MEM_REQUEST Stage
-> Blocks are never dirty, so why not evict right away
2) Branch not taken speculation
3) Runtime -02 not running on RTL, and -03 not running on RTL and Emulator
Vector:
1) Cycle accurate simulator (would require Cache Simulator)

View File

@@ -1,26 +1,53 @@
#!/bin/sh
# Copyright © 2019-2023
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
show_usage()
{
echo "Vortex BlackBox Test Driver v1.0"
echo "Usage: [[--clusters=#n] [--cores=#n] [--warps=#n] [--threads=#n] [--l2cache] [--l3cache] [[--driver=rtlsim|vlsim|simx] [--debug] [--scope] [--perf] [--app=vecadd|sgemm|basic|demo|dogfood] [--args=<args>] [--help]]"
echo "Usage: $0 [[--clusters=#n] [--cores=#n] [--warps=#n] [--threads=#n] [--l2cache] [--l3cache] [[--driver=#name] [--app=#app] [--args=#args] [--debug=#level] [--scope] [--perf=#class] [--rebuild=#n] [--log=logfile] [--help]]"
}
show_help()
{
show_usage
echo " where"
echo "--driver: simx, rtlsim, oape, xrt"
echo "--app: any subfolder test under regression or opencl"
echo "--class: 0=disable, 1=pipeline, 2=memsys"
echo "--rebuild: 0=disable, 1=force, 2=auto, 3=temp"
}
SCRIPT_DIR=$(dirname "$0")
VORTEX_HOME=$SCRIPT_DIR/..
DRIVER=vlsim
DRIVER=simx
APP=sgemm
CLUSTERS=1
CORES=1
WARPS=4
THREADS=4
L2=0
L3=0
L2=
L3=
DEBUG=0
DEBUG_LEVEL=0
SCOPE=0
HAS_ARGS=0
DEBUG_LEVEL=1
PERF_CLASS=0
REBUILD=2
TEMPBUILD=0
LOGFILE=run.log
for i in "$@"
do
@@ -50,14 +77,15 @@ case $i in
shift
;;
--l2cache)
L2=1
L2=-DL2_ENABLE
shift
;;
--l3cache)
L3=1
L3=-DL3_ENABLE
shift
;;
--debug)
--debug=*)
DEBUG_LEVEL=${i#*=}
DEBUG=1
shift
;;
@@ -66,8 +94,9 @@ case $i in
CORES=1
shift
;;
--perf)
--perf=*)
PERF_FLAG=-DPERF_ENABLE
PERF_CLASS=${i#*=}
shift
;;
--args=*)
@@ -75,33 +104,43 @@ case $i in
HAS_ARGS=1
shift
;;
--rebuild=*)
REBUILD=${i#*=}
shift
;;
--log=*)
LOGFILE=${i#*=}
shift
;;
--help)
show_usage
show_help
exit 0
;;
*)
show_usage
exit -1
;;
show_usage
exit -1
;;
esac
done
if [ $REBUILD -eq 3 ];
then
REBUILD=1
TEMPBUILD=1
fi
case $DRIVER in
rtlsim)
DRIVER_PATH=$VORTEX_HOME/driver/rtlsim
;;
vlsim)
DRIVER_PATH=$VORTEX_HOME/driver/vlsim
;;
asesim)
DRIVER_PATH=$VORTEX_HOME/driver/asesim
;;
fpga)
DRIVER_PATH=$VORTEX_HOME/driver/fpga
;;
simx)
DRIVER_PATH=$VORTEX_HOME/driver/simx
DEBUG_LEVEL=3
DRIVER_PATH=$VORTEX_HOME/runtime/simx
;;
rtlsim)
DRIVER_PATH=$VORTEX_HOME/runtime/rtlsim
;;
opae)
DRIVER_PATH=$VORTEX_HOME/runtime/opae
;;
xrt)
DRIVER_PATH=$VORTEX_HOME/runtime/xrt
;;
*)
echo "invalid driver: $DRIVER"
@@ -116,78 +155,156 @@ elif [ -d "$VORTEX_HOME/tests/regression/$APP" ];
then
APP_PATH=$VORTEX_HOME/tests/regression/$APP
else
echo "Application folder found: $APP"
echo "Application folder not found: $APP"
exit -1
fi
CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS -DL2_ENABLE=$L2 -DL3_ENABLE=$L3 $PERF_FLAG $CONFIGS"
CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS"
echo "CONFIGS=$CONFIGS"
BLACKBOX_CACHE=blackbox.$DRIVER.cache
if [ -f "$BLACKBOX_CACHE" ]
then
LAST_CONFIGS=`cat $BLACKBOX_CACHE`
fi
if [ "$CONFIGS+$DEBUG+$SCOPE" != "$LAST_CONFIGS" ];
if [ $REBUILD -ne 0 ]
then
make -C $DRIVER_PATH clean
BLACKBOX_CACHE=blackbox.$DRIVER.cache
if [ -f "$BLACKBOX_CACHE" ]
then
LAST_CONFIGS=`cat $BLACKBOX_CACHE`
fi
if [ $REBUILD -eq 1 ] || [ "$CONFIGS+$DEBUG+$SCOPE" != "$LAST_CONFIGS" ];
then
make -C $DRIVER_PATH clean > /dev/null
echo "$CONFIGS+$DEBUG+$SCOPE" > $BLACKBOX_CACHE
fi
fi
echo "$CONFIGS+$DEBUG+$SCOPE" > $BLACKBOX_CACHE
# export performance monitor class identifier
export PERF_CLASS=$PERF_CLASS
status=0
if [ $DEBUG -eq 1 ]
# ensure config update
make -C $VORTEX_HOME/hw config > /dev/null
# ensure the stub driver is present
make -C $VORTEX_HOME/runtime/stub > /dev/null
if [ $DEBUG -ne 0 ]
then
if [ $SCOPE -eq 1 ]
# running application
if [ $TEMPBUILD -eq 1 ]
then
echo "running: DEBUG=$DEBUG_LEVEL SCOPE=1 CONFIGS="$CONFIGS" make -C $DRIVER_PATH"
DEBUG=$DEBUG_LEVEL SCOPE=1 CONFIGS="$CONFIGS" make -C $DRIVER_PATH
# setup temp directory
TEMPDIR=$(mktemp -d)
mkdir -p "$TEMPDIR/$DRIVER"
# driver initialization
if [ $SCOPE -eq 1 ]
then
echo "running: DESTDIR=$TEMPDIR/$DRIVER DEBUG=$DEBUG_LEVEL SCOPE=1 CONFIGS=$CONFIGS make -C $DRIVER_PATH"
DESTDIR="$TEMPDIR/$DRIVER" DEBUG=$DEBUG_LEVEL SCOPE=1 CONFIGS="$CONFIGS" make -C $DRIVER_PATH > /dev/null
else
echo "running: DESTDIR=$TEMPDIR/$DRIVER DEBUG=$DEBUG_LEVEL CONFIGS=$CONFIGS make -C $DRIVER_PATH"
DESTDIR="$TEMPDIR/$DRIVER" DEBUG=$DEBUG_LEVEL CONFIGS="$CONFIGS" make -C $DRIVER_PATH > /dev/null
fi
# running application
if [ $HAS_ARGS -eq 1 ]
then
echo "running: VORTEX_RT_PATH=$TEMPDIR OPTS=$ARGS make -C $APP_PATH run-$DRIVER > $LOGFILE 2>&1"
VORTEX_RT_PATH=$TEMPDIR OPTS=$ARGS make -C $APP_PATH run-$DRIVER > $LOGFILE 2>&1
status=$?
else
echo "running: VORTEX_RT_PATH=$TEMPDIR make -C $APP_PATH run-$DRIVER > $LOGFILE 2>&1"
VORTEX_RT_PATH=$TEMPDIR make -C $APP_PATH run-$DRIVER > $LOGFILE 2>&1
status=$?
fi
# cleanup temp directory
trap "rm -rf $TEMPDIR" EXIT
else
echo "running: DEBUG=$DEBUG_LEVEL CONFIGS="$CONFIGS" make -C $DRIVER_PATH"
DEBUG=$DEBUG_LEVEL CONFIGS="$CONFIGS" make -C $DRIVER_PATH
fi
if [ $HAS_ARGS -eq 1 ]
then
echo "running: OPTS=$ARGS make -C $APP_PATH run-$DRIVER > run.log 2>&1"
OPTS=$ARGS make -C $APP_PATH run-$DRIVER > run.log 2>&1
status=$?
else
echo "running: make -C $APP_PATH run-$DRIVER > run.log 2>&1"
make -C $APP_PATH run-$DRIVER > run.log 2>&1
status=$?
# driver initialization
if [ $SCOPE -eq 1 ]
then
echo "running: DEBUG=$DEBUG_LEVEL SCOPE=1 CONFIGS=$CONFIGS make -C $DRIVER_PATH"
DEBUG=$DEBUG_LEVEL SCOPE=1 CONFIGS="$CONFIGS" make -C $DRIVER_PATH > /dev/null
else
echo "running: DEBUG=$DEBUG_LEVEL CONFIGS=$CONFIGS make -C $DRIVER_PATH"
DEBUG=$DEBUG_LEVEL CONFIGS="$CONFIGS" make -C $DRIVER_PATH > /dev/null
fi
# running application
if [ $HAS_ARGS -eq 1 ]
then
echo "running: OPTS=$ARGS make -C $APP_PATH run-$DRIVER > $LOGFILE 2>&1"
OPTS=$ARGS make -C $APP_PATH run-$DRIVER > $LOGFILE 2>&1
status=$?
else
echo "running: make -C $APP_PATH run-$DRIVER > $LOGFILE 2>&1"
make -C $APP_PATH run-$DRIVER > $LOGFILE 2>&1
status=$?
fi
fi
if [ -f "$APP_PATH/trace.vcd" ]
then
mv -f $APP_PATH/trace.vcd .
fi
else
echo "driver initialization..."
if [ $SCOPE -eq 1 ]
else
if [ $TEMPBUILD -eq 1 ]
then
echo "running: SCOPE=1 CONFIGS="$CONFIGS" make -C $DRIVER_PATH"
SCOPE=1 CONFIGS="$CONFIGS" make -C $DRIVER_PATH
# setup temp directory
TEMPDIR=$(mktemp -d)
mkdir -p "$TEMPDIR/$DRIVER"
# driver initialization
if [ $SCOPE -eq 1 ]
then
echo "running: DESTDIR=$TEMPDIR/$DRIVER SCOPE=1 CONFIGS=$CONFIGS make -C $DRIVER_PATH"
DESTDIR="$TEMPDIR/$DRIVER" SCOPE=1 CONFIGS="$CONFIGS" make -C $DRIVER_PATH > /dev/null
else
echo "running: DESTDIR=$TEMPDIR/$DRIVER CONFIGS=$CONFIGS make -C $DRIVER_PATH"
DESTDIR="$TEMPDIR/$DRIVER" CONFIGS="$CONFIGS" make -C $DRIVER_PATH > /dev/null
fi
# running application
if [ $HAS_ARGS -eq 1 ]
then
echo "running: VORTEX_RT_PATH=$TEMPDIR OPTS=$ARGS make -C $APP_PATH run-$DRIVER"
VORTEX_RT_PATH=$TEMPDIR OPTS=$ARGS make -C $APP_PATH run-$DRIVER
status=$?
else
echo "running: VORTEX_RT_PATH=$TEMPDIR make -C $APP_PATH run-$DRIVER"
VORTEX_RT_PATH=$TEMPDIR make -C $APP_PATH run-$DRIVER
status=$?
fi
# cleanup temp directory
trap "rm -rf $TEMPDIR" EXIT
else
echo "running: CONFIGS="$CONFIGS" make -C $DRIVER_PATH"
CONFIGS="$CONFIGS" make -C $DRIVER_PATH
fi
echo "running application..."
if [ $HAS_ARGS -eq 1 ]
then
echo "running: OPTS=$ARGS make -C $APP_PATH run-$DRIVER"
OPTS=$ARGS make -C $APP_PATH run-$DRIVER
status=$?
else
echo "running: make -C $APP_PATH run-$DRIVER"
make -C $APP_PATH run-$DRIVER
status=$?
# driver initialization
if [ $SCOPE -eq 1 ]
then
echo "running: SCOPE=1 CONFIGS=$CONFIGS make -C $DRIVER_PATH"
SCOPE=1 CONFIGS="$CONFIGS" make -C $DRIVER_PATH > /dev/null
else
echo "running: CONFIGS=$CONFIGS make -C $DRIVER_PATH"
CONFIGS="$CONFIGS" make -C $DRIVER_PATH > /dev/null
fi
# running application
if [ $HAS_ARGS -eq 1 ]
then
echo "running: OPTS=$ARGS make -C $APP_PATH run-$DRIVER"
OPTS=$ARGS make -C $APP_PATH run-$DRIVER
status=$?
else
echo "running: make -C $APP_PATH run-$DRIVER"
make -C $APP_PATH run-$DRIVER
status=$?
fi
fi
fi
exit $status
exit $status

View File

@@ -1,73 +0,0 @@
#!/bin/bash
# exit when any command fails
set -e
OS_DIR=${OS_DIR:-'ubuntu/bionic'}
SRCDIR=${SRCDIR:-'/opt'}
DESTDIR=${DESTDIR:-'.'}
echo "OS_DIR=${OS_DIR}"
echo "SRCDIR=${SRCDIR}"
echo "DESTDIR=${DESTDIR}"
riscv()
{
echo "prebuilt riscv-gnu-toolchain..."
tar -C $SRCDIR -cvjf riscv-gnu-toolchain.tar.bz2 riscv-gnu-toolchain
split -b 50M riscv-gnu-toolchain.tar.bz2 "riscv-gnu-toolchain.tar.bz2.part"
mv riscv-gnu-toolchain.tar.bz2.part* $DESTDIR/riscv-gnu-toolchain/$OS_DIR
rm riscv-gnu-toolchain.tar.bz2
}
llvm()
{
echo "prebuilt llvm-riscv..."
tar -C $SRCDIR -cvjf llvm-vortex1.tar.bz2 llvm-riscv
split -b 50M llvm-vortex1.tar.bz2 "llvm-vortex1.tar.bz2.part"
mv llvm-vortex1.tar.bz2.part* $DESTDIR/llvm-vortex/$OS_DIR
rm llvm-vortex1.tar.bz2
}
pocl()
{
echo "prebuilt pocl..."
tar -C $SRCDIR -cvjf pocl1.tar.bz2 pocl
mv pocl1.tar.bz2 $DESTDIR/pocl/$OS_DIR
}
verilator()
{
echo "prebuilt verilator..."
tar -C $SRCDIR -cvjf verilator.tar.bz2 verilator
mv verilator.tar.bz2 $DESTDIR/verilator/$OS_DIR
}
usage()
{
echo "usage: prebuilt [[-riscv] [-llvm] [-pocl] [-verilator] [-all] [-h|--help]]"
}
while [ "$1" != "" ]; do
case $1 in
-pocl ) pocl
;;
-verilator ) verilator
;;
-riscv ) riscv
;;
-llvm ) llvm
;;
-all ) riscv
llvm
pocl
verilator
;;
-h | --help ) usage
exit
;;
* ) usage
exit 1
esac
shift
done

View File

@@ -1,44 +1,106 @@
#!/bin/bash
# Copyright © 2019-2023
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# exit when any command fails
set -e
# ensure build
make -s
# clear blackbox cache
rm -f blackbox.*.cache
unittest()
{
make -C tests/unittest run
make -C hw/unittest > /dev/null
}
coverage()
isa()
{
echo "begin coverage tests..."
echo "begin isa tests..."
make -C tests/runtime run-rtlsim
make -C tests/riscv/isa run-rtlsim
make -C tests/regression run-vlsim
make -C tests/opencl run-vlsim
make -C tests/runtime run-simx
make -C tests/riscv/isa run-simx
make -C tests/regression run-simx
make -C tests/opencl run-simx
make -C tests/riscv/isa run-rtlsim
echo "coverage tests done!"
make -C sim/rtlsim clean && CONFIGS="-DDPI_DISABLE" make -C sim/rtlsim > /dev/null
make -C tests/riscv/isa run-rtlsim
make -C sim/rtlsim clean && CONFIGS="-DFPU_FPNEW" make -C sim/rtlsim > /dev/null
make -C tests/riscv/isa run-rtlsim-32f
make -C sim/rtlsim clean && CONFIGS="-DFPU_DPI" make -C sim/rtlsim > /dev/null
make -C tests/riscv/isa run-rtlsim-32f
make -C sim/rtlsim clean && CONFIGS="-DFPU_DSP" make -C sim/rtlsim > /dev/null
make -C tests/riscv/isa run-rtlsim-32f
if [ "$XLEN" == "64" ]
then
make -C sim/rtlsim clean && CONFIGS="-DFPU_FPNEW" make -C sim/rtlsim > /dev/null
make -C tests/riscv/isa run-rtlsim-64f
make -C sim/rtlsim clean && CONFIGS="-DEXT_D_ENABLE -DFPU_FPNEW" make -C sim/rtlsim > /dev/null
make -C tests/riscv/isa run-rtlsim-64d || true
make -C sim/rtlsim clean && CONFIGS="-DFPU_DPI" make -C sim/rtlsim > /dev/null
make -C tests/riscv/isa run-rtlsim-64f
make -C sim/rtlsim clean && CONFIGS="-DFPU_DSP" make -C sim/rtlsim > /dev/null
make -C tests/riscv/isa run-rtlsim-64fx
fi
# restore default prebuilt configuration
make -C sim/rtlsim clean && make -C sim/rtlsim > /dev/null
echo "isa tests done!"
}
tex()
regression()
{
echo "begin texture tests..."
echo "begin regression tests..."
CONFIGS="-DEXT_TEX_ENABLE=1" ./ci/blackbox.sh --driver=vlsim --app=tex --args="-isoccer.png -osoccer_result.png -g0"
CONFIGS="-DEXT_TEX_ENABLE=1" ./ci/blackbox.sh --driver=simx --app=tex --args="-isoccer.png -osoccer_result.png -g0"
CONFIGS="-DEXT_TEX_ENABLE=1" ./ci/blackbox.sh --driver=rtlsim --app=tex --args="-itoad.png -otoad_result.png -g1"
CONFIGS="-DEXT_TEX_ENABLE=1" ./ci/blackbox.sh --driver=simx --app=tex --args="-irainbow.png -orainbow_result.png -g2"
CONFIGS="-DEXT_TEX_ENABLE=1" ./ci/blackbox.sh --driver=rtlsim --app=tex --args="-itoad.png -otoad_result.png -g1" --perf
CONFIGS="-DEXT_TEX_ENABLE=1" ./ci/blackbox.sh --driver=simx --app=tex --args="-itoad.png -otoad_result.png -g1" --perf
make -C tests/kernel run-simx
make -C tests/kernel run-rtlsim
echo "coverage texture done!"
make -C tests/regression run-simx
make -C tests/regression run-rtlsim
# test FPU hardware implementations
CONFIGS="-DFPU_DPI" ./ci/blackbox.sh --driver=rtlsim --app=dogfood
CONFIGS="-DFPU_DSP" ./ci/blackbox.sh --driver=rtlsim --app=dogfood
CONFIGS="-DFPU_FPNEW" ./ci/blackbox.sh --driver=rtlsim --app=dogfood
# test local barrier
./ci/blackbox.sh --driver=simx --app=dogfood --args="-n1 -t19"
./ci/blackbox.sh --driver=rtlsim --app=dogfood --args="-n1 -t19"
# test global barrier
CONFIGS="-DGBAR_ENABLE" ./ci/blackbox.sh --driver=simx --app=dogfood --args="-n1 -t20" --cores=2
CONFIGS="-DGBAR_ENABLE" ./ci/blackbox.sh --driver=rtlsim --app=dogfood --args="-n1 -t20" --cores=2
# test FPU core
echo "regression tests done!"
}
opencl()
{
echo "begin opencl tests..."
make -C tests/opencl run-simx
make -C tests/opencl run-rtlsim
echo "opencl tests done!"
}
cluster()
@@ -46,23 +108,26 @@ cluster()
echo "begin clustering tests..."
# warp/threads configurations
./ci/blackbox.sh --driver=rtlsim --cores=1 --warps=2 --threads=8 --app=demo
./ci/blackbox.sh --driver=rtlsim --cores=1 --warps=8 --threads=2 --app=demo
./ci/blackbox.sh --driver=simx --cores=1 --warps=8 --threads=16 --app=demo
./ci/blackbox.sh --driver=rtlsim --cores=1 --warps=1 --threads=1 --app=diverge
./ci/blackbox.sh --driver=rtlsim --cores=1 --warps=2 --threads=2 --app=diverge
./ci/blackbox.sh --driver=rtlsim --cores=1 --warps=2 --threads=8 --app=diverge
./ci/blackbox.sh --driver=rtlsim --cores=1 --warps=8 --threads=2 --app=diverge
./ci/blackbox.sh --driver=simx --cores=1 --warps=1 --threads=1 --app=diverge
./ci/blackbox.sh --driver=simx --cores=1 --warps=8 --threads=16 --app=diverge
# cores clustering
./ci/blackbox.sh --driver=rtlsim --cores=1 --clusters=1 --app=demo --args="-n1"
./ci/blackbox.sh --driver=rtlsim --cores=4 --clusters=1 --app=demo --args="-n1"
./ci/blackbox.sh --driver=rtlsim --cores=2 --clusters=2 --app=demo --args="-n1"
./ci/blackbox.sh --driver=simx --cores=4 --clusters=1 --app=demo --args="-n1"
./ci/blackbox.sh --driver=simx --cores=4 --clusters=2 --app=demo --args="-n1"
./ci/blackbox.sh --driver=rtlsim --cores=1 --clusters=1 --app=diverge --args="-n1"
./ci/blackbox.sh --driver=rtlsim --cores=4 --clusters=1 --app=diverge --args="-n1"
./ci/blackbox.sh --driver=rtlsim --cores=2 --clusters=2 --app=diverge --args="-n1"
./ci/blackbox.sh --driver=simx --cores=4 --clusters=1 --app=diverge --args="-n1"
./ci/blackbox.sh --driver=simx --cores=4 --clusters=2 --app=diverge --args="-n1"
# L2/L3
./ci/blackbox.sh --driver=rtlsim --cores=2 --l2cache --app=demo --args="-n1"
./ci/blackbox.sh --driver=rtlsim --cores=2 --clusters=2 --l3cache --app=demo --args="-n1"
./ci/blackbox.sh --driver=rtlsim --cores=2 --l2cache --app=diverge --args="-n1"
./ci/blackbox.sh --driver=rtlsim --cores=2 --clusters=2 --l3cache --app=diverge --args="-n1"
./ci/blackbox.sh --driver=rtlsim --cores=2 --clusters=2 --l2cache --l3cache --app=io_addr --args="-n1"
./ci/blackbox.sh --driver=simx --cores=4 --clusters=2 --l2cache --app=demo --args="-n1"
./ci/blackbox.sh --driver=simx --cores=4 --clusters=4 --l2cache --l3cache --app=demo --args="-n1"
./ci/blackbox.sh --driver=simx --cores=4 --clusters=2 --l2cache --app=diverge --args="-n1"
./ci/blackbox.sh --driver=simx --cores=4 --clusters=4 --l2cache --l3cache --app=diverge --args="-n1"
echo "clustering tests done!"
}
@@ -71,11 +136,23 @@ debug()
{
echo "begin debugging tests..."
./ci/blackbox.sh --driver=vlsim --cores=2 --clusters=2 --l2cache --perf --app=demo --args="-n1"
./ci/blackbox.sh --driver=simx --cores=2 --clusters=2 --l2cache --perf --app=demo --args="-n1"
./ci/blackbox.sh --driver=vlsim --cores=2 --clusters=2 --l2cache --debug --app=demo --args="-n1"
./ci/blackbox.sh --driver=simx --cores=2 --clusters=2 --l2cache --debug --app=demo --args="-n1"
./ci/blackbox.sh --driver=vlsim --cores=1 --scope --app=basic --args="-t0 -n1"
# test CSV trace generation
make -C sim/simx clean && DEBUG=3 make -C sim/simx > /dev/null
make -C sim/rtlsim clean && DEBUG=3 CONFIGS="-DGPR_RESET" make -C sim/rtlsim > /dev/null
make -C tests/riscv/isa run-simx-32im > run_simx.log
make -C tests/riscv/isa run-rtlsim-32im > run_rtlsim.log
./ci/trace_csv.py -trtlsim run_rtlsim.log -otrace_rtlsim.csv
./ci/trace_csv.py -tsimx run_simx.log -otrace_simx.csv
diff trace_rtlsim.csv trace_simx.csv
# restore default prebuilt configuration
make -C sim/simx clean && make -C sim/simx > /dev/null
make -C sim/rtlsim clean && make -C sim/rtlsim > /dev/null
./ci/blackbox.sh --driver=opae --cores=2 --clusters=2 --l2cache --perf=1 --app=demo --args="-n1"
./ci/blackbox.sh --driver=simx --cores=2 --clusters=2 --l2cache --perf=1 --app=demo --args="-n1"
./ci/blackbox.sh --driver=opae --cores=2 --clusters=2 --l2cache --debug=1 --app=demo --args="-n1"
./ci/blackbox.sh --driver=simx --cores=2 --clusters=2 --l2cache --debug=1 --app=demo --args="-n1"
./ci/blackbox.sh --driver=opae --cores=1 --scope --app=basic --args="-t0 -n1"
echo "debugging tests done!"
}
@@ -84,51 +161,77 @@ config()
{
echo "begin configuration tests..."
# disable DPI
CONFIGS="-DDPI_DISABLE -DFPU_FPNEW" ./ci/blackbox.sh --driver=rtlsim --app=dogfood
CONFIGS="-DDPI_DISABLE -DFPU_FPNEW" ./ci/blackbox.sh --driver=opae --app=dogfood
# issue width
CONFIGS="-DISSUE_WIDTH=1" ./ci/blackbox.sh --driver=rtlsim --app=diverge
CONFIGS="-DISSUE_WIDTH=2" ./ci/blackbox.sh --driver=rtlsim --app=diverge
CONFIGS="-DISSUE_WIDTH=1" ./ci/blackbox.sh --driver=simx --app=diverge
CONFIGS="-DISSUE_WIDTH=2" ./ci/blackbox.sh --driver=simx --app=diverge
# dispatch size
CONFIGS="-DNUM_ALU_BLOCK=1 -DNUM_ALU_LANES=1" ./ci/blackbox.sh --driver=rtlsim --app=diverge
CONFIGS="-DNUM_ALU_BLOCK=2 -DNUM_ALU_LANES=2" ./ci/blackbox.sh --driver=rtlsim --app=diverge
CONFIGS="-DNUM_ALU_BLOCK=1 -DNUM_ALU_LANES=1" ./ci/blackbox.sh --driver=simx --app=diverge
CONFIGS="-DNUM_ALU_BLOCK=2 -DNUM_ALU_LANES=2" ./ci/blackbox.sh --driver=simx --app=diverge
# FPU scaling
CONFIGS="-DNUM_ALU_BLOCK=4 -DNUM_FPU_LANES=2" ./ci/blackbox.sh --driver=rtlsim --app=sgemm
CONFIGS="-DNUM_ALU_BLOCK=2 -DNUM_FPU_LANES=4" ./ci/blackbox.sh --driver=rtlsim --app=sgemm
CONFIGS="-DNUM_ALU_BLOCK=4 -DNUM_FPU_LANES=4" ./ci/blackbox.sh --driver=rtlsim --app=sgemm
# custom program startup address
make -C tests/regression/dogfood clean-all
STARTUP_ADDR=0x40000000 make -C tests/regression/dogfood
CONFIGS="-DSTARTUP_ADDR=0x40000000" ./ci/blackbox.sh --driver=simx --app=dogfood
CONFIGS="-DSTARTUP_ADDR=0x40000000" ./ci/blackbox.sh --driver=rtlsim --app=dogfood
make -C tests/regression/dogfood clean-all
make -C tests/regression/dogfood
# disabling M extension
CONFIGS=-DEXT_M_DISABLE ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=no_mf_ext
CONFIGS="-DEXT_M_DISABLE" ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=no_mf_ext
# disabling F extension
CONFIGS=-DEXT_F_DISABLE ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=no_mf_ext
CONFIGS=-DEXT_F_DISABLE ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=no_mf_ext --perf
CONFIGS=-DEXT_F_DISABLE ./ci/blackbox.sh --driver=simx --cores=1 --app=no_mf_ext --perf
CONFIGS="-DEXT_F_DISABLE" ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=no_mf_ext
CONFIGS="-DEXT_F_DISABLE" ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=no_mf_ext --perf=1
CONFIGS="-DEXT_F_DISABLE" ./ci/blackbox.sh --driver=simx --cores=1 --app=no_mf_ext --perf=1
# disable shared memory
CONFIGS=-DSM_ENABLE=0 ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=no_smem
CONFIGS=-DSM_ENABLE=0 ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=no_smem --perf
CONFIGS=-DSM_ENABLE=0 ./ci/blackbox.sh --driver=simx --cores=1 --app=no_smem --perf
CONFIGS="-DSM_DISABLE" ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=no_smem
CONFIGS="-DSM_DISABLE" ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=no_smem --perf=1
CONFIGS="-DSM_DISABLE" ./ci/blackbox.sh --driver=simx --cores=1 --app=no_smem --perf=1
# using Default FPU core
FPU_CORE=FPU_DEFAULT ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=dogfood
# disable L1 cache
CONFIGS="-DL1_DISABLE -DSM_DISABLE" ./ci/blackbox.sh --driver=rtlsim --app=sgemm
CONFIGS="-DDCACHE_DISABLE" ./ci/blackbox.sh --driver=rtlsim --app=sgemm
# using FPNEW FPU core
FPU_CORE=FPU_FPNEW ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=dogfood
# multiple L1 caches per cluster
CONFIGS="-DNUM_DCACHES=2 -DNUM_ICACHES=2" ./ci/blackbox.sh --driver=rtlsim --app=sgemm --cores=8 --warps=1 --threads=2
# using AXI bus
# test AXI bus
AXI_BUS=1 ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=demo
# adjust l1 block size to match l2
CONFIGS="-DL1_BLOCK_SIZE=64" ./ci/blackbox.sh --driver=rtlsim --cores=2 --l2cache --app=io_addr --args="-n1"
CONFIGS="-DL1_LINE_SIZE=64" ./ci/blackbox.sh --driver=rtlsim --cores=2 --l2cache --app=io_addr --args="-n1"
# test cache banking
CONFIGS="-DDNUM_BANKS=1" ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=io_addr
CONFIGS="-DDNUM_BANKS=2" ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=io_addr
CONFIGS="-DDNUM_BANKS=2" ./ci/blackbox.sh --driver=simx --cores=1 --app=io_addr
# test cache multi-porting
CONFIGS="-DDNUM_PORTS=2" ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=io_addr
CONFIGS="-DDNUM_PORTS=2" ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=demo --debug --args="-n1"
CONFIGS="-DL2_NUM_PORTS=2 -DDNUM_PORTS=2" ./ci/blackbox.sh --driver=rtlsim --cores=2 --l2cache --app=io_addr
CONFIGS="-DL2_NUM_PORTS=4 -DDNUM_PORTS=4" ./ci/blackbox.sh --driver=rtlsim --cores=4 --l2cache --app=io_addr
CONFIGS="-DL2_NUM_PORTS=4 -DDNUM_PORTS=4" ./ci/blackbox.sh --driver=simx --cores=4 --l2cache --app=io_addr
CONFIGS="-DSMEM_NUM_BANKS=4 -DDCACHE_NUM_BANKS=1" ./ci/blackbox.sh --driver=rtlsim --app=sgemm
CONFIGS="-DSMEM_NUM_BANKS=2 -DDCACHE_NUM_BANKS=2" ./ci/blackbox.sh --driver=rtlsim --app=sgemm
CONFIGS="-DSMEM_NUM_BANKS=2 -DDCACHE_NUM_BANKS=2" ./ci/blackbox.sh --driver=simx --app=sgemm
CONFIGS="-DDCACHE_NUM_BANKS=1" ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=sgemm
CONFIGS="-DDCACHE_NUM_BANKS=2" ./ci/blackbox.sh --driver=rtlsim --cores=1 --app=sgemm
CONFIGS="-DDCACHE_NUM_BANKS=2" ./ci/blackbox.sh --driver=simx --cores=1 --app=sgemm
# test 128-bit MEM block
CONFIGS=-DMEM_BLOCK_SIZE=16 ./ci/blackbox.sh --driver=vlsim --cores=1 --app=demo
CONFIGS="-DMEM_BLOCK_SIZE=16" ./ci/blackbox.sh --driver=opae --cores=1 --app=demo
# test single-bank DRAM
CONFIGS="-DPLATFORM_PARAM_LOCAL_MEMORY_BANKS=1" ./ci/blackbox.sh --driver=vlsim --cores=1 --app=demo
CONFIGS="-DPLATFORM_PARAM_LOCAL_MEMORY_BANKS=1" ./ci/blackbox.sh --driver=opae --cores=1 --app=demo
# test 27-bit DRAM address
CONFIGS="-DPLATFORM_PARAM_LOCAL_MEMORY_ADDR_WIDTH=27" ./ci/blackbox.sh --driver=vlsim --cores=1 --app=demo
CONFIGS="-DPLATFORM_PARAM_LOCAL_MEMORY_ADDR_WIDTH=27" ./ci/blackbox.sh --driver=opae --cores=1 --app=demo
echo "configuration tests done!"
}
@@ -138,14 +241,9 @@ stress0()
echo "begin stress0 tests..."
# test verilator reset values
CONFIGS="-DVERILATOR_RESET_VALUE=0" ./ci/blackbox.sh --driver=vlsim --cores=2 --clusters=2 --l2cache --l3cache --app=sgemm
CONFIGS="-DVERILATOR_RESET_VALUE=1" ./ci/blackbox.sh --driver=vlsim --cores=2 --clusters=2 --l2cache --l3cache --app=sgemm
FPU_CORE=FPU_DEFAULT CONFIGS="-DVERILATOR_RESET_VALUE=0" ./ci/blackbox.sh --driver=vlsim --cores=2 --clusters=2 --l2cache --l3cache --app=dogfood
FPU_CORE=FPU_DEFAULT CONFIGS="-DVERILATOR_RESET_VALUE=1" ./ci/blackbox.sh --driver=vlsim --cores=2 --clusters=2 --l2cache --l3cache --app=dogfood
CONFIGS="-DVERILATOR_RESET_VALUE=0" ./ci/blackbox.sh --driver=vlsim --cores=2 --clusters=2 --l2cache --l3cache --app=io_addr
CONFIGS="-DVERILATOR_RESET_VALUE=1" ./ci/blackbox.sh --driver=vlsim --cores=2 --clusters=2 --l2cache --l3cache --app=io_addr
CONFIGS="-DVERILATOR_RESET_VALUE=0" ./ci/blackbox.sh --driver=vlsim --app=printf
CONFIGS="-DVERILATOR_RESET_VALUE=1" ./ci/blackbox.sh --driver=vlsim --app=printf
CONFIGS="-DVERILATOR_RESET_VALUE=1" ./ci/blackbox.sh --driver=opae --cores=2 --clusters=2 --l2cache --l3cache --app=dogfood
CONFIGS="-DVERILATOR_RESET_VALUE=1" ./ci/blackbox.sh --driver=opae --cores=2 --clusters=2 --l2cache --l3cache --app=io_addr
CONFIGS="-DVERILATOR_RESET_VALUE=1" ./ci/blackbox.sh --driver=opae --app=printf
echo "stress0 tests done!"
}
@@ -154,51 +252,75 @@ stress1()
{
echo "begin stress1 tests..."
./ci/blackbox.sh --driver=rtlsim --cores=2 --l2cache --clusters=2 --l3cache --app=sgemm --args="-n256"
./ci/blackbox.sh --driver=rtlsim --app=sgemm --args="-n128" --l2cache
echo "stress1 tests done!"
}
usage()
synthesis()
{
echo "usage: regression [-unittest] [-coverage] [-tex] [-cluster] [-debug] [-config] [-stress[#n]] [-all] [-h|--help]"
echo "begin synthesis tests..."
PREFIX=build_base make -C hw/syn/yosys clean
PREFIX=build_base CONFIGS="-DDPI_DISABLE -DEXT_F_DISABLE" make -C hw/syn/yosys elaborate
echo "synthesis tests done!"
}
show_usage()
{
echo "Vortex Regression Test"
echo "Usage: $0 [--unittest] [--isa] [--regression] [--opencl] [--cluster] [--debug] [--config] [--stress[#n]] [--synthesis] [--all] [--h|--help]"
}
start=$SECONDS
while [ "$1" != "" ]; do
case $1 in
-unittest ) unittest
--unittest ) unittest
;;
-coverage ) coverage
--isa ) isa
;;
-tex ) tex
--regression ) regression
;;
-cluster ) cluster
--opencl ) opencl
;;
-debug ) debug
--cluster ) cluster
;;
-config ) config
--debug ) debug
;;
-stress0 ) stress0
--config ) config
;;
-stress1 ) stress1
--stress0 ) stress0
;;
-stress ) stress0
--stress1 ) stress1
;;
--stress ) stress0
stress1
;;
-all ) unittest
coverage
tex
--synthesis ) synthesis
;;
--all ) unittest
isa
regression
opencl
cluster
debug
config
stress0
stress1
synthesis
;;
-h | --help ) usage
-h | --help ) show_usage
exit
;;
* ) usage
* ) show_usage
exit 1
esac
shift
done
done
echo "Regression completed!"
duration=$(( SECONDS - start ))
awk -v t=$duration 'BEGIN{t=int(t*1000); printf "Elapsed Time: %d:%02d:%02d\n", t/3600000, t/60000%60, t/1000%60}'

View File

@@ -1,38 +0,0 @@
#!/bin/bash
# exit when any command fails
set -e
# ensure build
make -s
coverage()
{
echo "begin coverage tests..."
make -C sim/simx clean
XLEN=64 make -C sim/simx
XLEN=64 make -C tests/riscv/isa run-simx
echo "coverage tests done!"
}
usage()
{
echo "usage: regression [-coverage] [-all] [-h|--help]"
}
while [ "$1" != "" ]; do
case $1 in
-coverage ) coverage
;;
-all ) coverage
;;
-h | --help ) usage
exit
;;
* ) usage
exit 1
esac
shift
done

View File

@@ -1,30 +0,0 @@
#!/bin/bash
# exit when any command fails
set -e
# ensure build
make -s
# clear POCL cache
rm -rf ~/.cache/pocl
# rebuild runtime
make -C runtime clean
make -C runtime
# rebuild drivers
make -C driver clean
make -C driver
# rebuild runtime tests
make -C tests/runtime clean
make -C tests/runtime
# rebuild regression tests
make -C tests/regression clean-all
make -C tests/regression
# rebuild opencl tests
make -C tests/opencl clean-all
make -C tests/opencl

30
ci/toolchain_env.sh Normal file
View File

@@ -0,0 +1,30 @@
#!/bin/sh
# Copyright 2023 blaise
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
TOOLDIR=${TOOLDIR:=/opt}
export VERILATOR_ROOT=$TOOLDIR/verilator
export PATH=$VERILATOR_ROOT/bin:$PATH
export SV2V_PATH=$TOOLDIR/sv2v
export PATH=$SV2V_PATH/bin:$PATH
export YOSYS_PATH=$TOOLDIR/yosys
export PATH=$YOSYS_PATH/bin:$PATH
export LLVM_VORTEX=$TOOLDIR/llvm-vortex
export POCL_CC_PATH=$TOOLDIR/pocl/compiler
export POCL_RT_PATH=$TOOLDIR/pocl/runtime

View File

@@ -1,97 +1,184 @@
#!/bin/bash
# Copyright © 2019-2023
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# exit when any command fails
set -e
REPOSITORY=https://github.com/vortexgpgpu/vortex-toolchain-prebuilt/raw/master
TOOLDIR=${TOOLDIR:=/opt}
OSDIR=${OSDIR:=ubuntu/bionic}
DESTDIR="${DESTDIR:=/opt}"
OS="${OS:=ubuntu/bionic}"
riscv()
{
for x in {a..j}
case $OSDIR in
"centos/7") parts=$(eval echo {a..h}) ;;
*) parts=$(eval echo {a..j}) ;;
esac
rm -f riscv-gnu-toolchain.tar.bz2.parta*
for x in $parts
do
wget $REPOSITORY/riscv-gnu-toolchain/ubuntu/bionic/riscv-gnu-toolchain.tar.bz2.parta$x
wget $REPOSITORY/riscv-gnu-toolchain/$OSDIR/riscv-gnu-toolchain.tar.bz2.parta$x
done
cat riscv-gnu-toolchain.tar.bz2.parta* > riscv-gnu-toolchain.tar.bz2
tar -xvf riscv-gnu-toolchain.tar.bz2
rm -f riscv-gnu-toolchain.tar.bz2*
cp -r riscv-gnu-toolchain $DESTDIR
cp -r riscv-gnu-toolchain $TOOLDIR
rm -f riscv-gnu-toolchain.tar.bz2*
rm -rf riscv-gnu-toolchain
}
riscv64()
{
for x in {a..j}
case $OSDIR in
"centos/7") parts=$(eval echo {a..h}) ;;
*) parts=$(eval echo {a..j}) ;;
esac
rm -f riscv64-gnu-toolchain.tar.bz2.parta*
for x in $parts
do
wget $REPOSITORY/riscv64-gnu-toolchain/ubuntu/bionic/riscv64-gnu-toolchain.tar.bz2.parta$x
wget $REPOSITORY/riscv64-gnu-toolchain/$OSDIR/riscv64-gnu-toolchain.tar.bz2.parta$x
done
cat riscv64-gnu-toolchain.tar.bz2.parta* > riscv64-gnu-toolchain.tar.bz2
tar -xvf riscv64-gnu-toolchain.tar.bz2
rm -f riscv64-gnu-toolchain.tar.bz2*
cp -r riscv64-gnu-toolchain $DESTDIR
cp -r riscv64-gnu-toolchain $TOOLDIR
rm -f riscv64-gnu-toolchain.tar.bz2*
rm -rf riscv64-gnu-toolchain
}
llvm()
llvm-vortex()
{
for x in {a..b}
case $OSDIR in
"centos/7") parts=$(eval echo {a..b}) ;;
*) parts=$(eval echo {a..b}) ;;
esac
echo $parts
rm -f llvm-vortex.tar.bz2.parta*
for x in $parts
do
wget $REPOSITORY/llvm-vortex/ubuntu/bionic/llvm-vortex1.tar.bz2.parta$x
wget $REPOSITORY/llvm-vortex/$OSDIR/llvm-vortex.tar.bz2.parta$x
done
cat llvm-vortex1.tar.bz2.parta* > llvm-vortex1.tar.bz2
tar -xvf llvm-vortex1.tar.bz2
rm -f llvm-vortex1.tar.bz2*
cp -r llvm-riscv $DESTDIR
rm -rf llvm-riscv
cat llvm-vortex.tar.bz2.parta* > llvm-vortex.tar.bz2
tar -xvf llvm-vortex.tar.bz2
cp -r llvm-vortex $TOOLDIR
rm -f llvm-vortex.tar.bz2*
rm -rf llvm-vortex
}
llvm-pocl()
{
case $OSDIR in
"centos/7") parts=$(eval echo {a..b}) ;;
*) parts=$(eval echo {a..b}) ;;
esac
echo $parts
rm -f llvm-pocl.tar.bz2.parta*
for x in $parts
do
wget $REPOSITORY/llvm-pocl/$OSDIR/llvm-pocl.tar.bz2.parta$x
done
cat llvm-pocl.tar.bz2.parta* > llvm-pocl.tar.bz2
tar -xvf llvm-pocl.tar.bz2
cp -r llvm-pocl $TOOLDIR
rm -f llvm-pocl.tar.bz2*
rm -rf llvm-pocl
}
pocl()
{
wget $REPOSITORY/pocl/ubuntu/bionic/pocl1.tar.bz2
tar -xvf pocl1.tar.bz2
rm -f pocl1.tar.bz2
cp -r pocl $DESTDIR
wget $REPOSITORY/pocl/$OSDIR/pocl.tar.bz2
tar -xvf pocl.tar.bz2
rm -f pocl.tar.bz2
cp -r pocl $TOOLDIR
rm -rf pocl
}
verilator()
{
wget $REPOSITORY/verilator/ubuntu/bionic/verilator.tar.bz2
wget $REPOSITORY/verilator/$OSDIR/verilator.tar.bz2
tar -xvf verilator.tar.bz2
rm -f verilator.tar.bz2
cp -r verilator $DESTDIR
cp -r verilator $TOOLDIR
rm -f verilator.tar.bz2
rm -rf verilator
}
usage()
sv2v()
{
echo "usage: toolchain_install [[-riscv] [-riscv64] [-llvm] [-pocl] [-verilator] [-all] [-h|--help]]"
wget $REPOSITORY/sv2v/$OSDIR/sv2v.tar.bz2
tar -xvf sv2v.tar.bz2
rm -f sv2v.tar.bz2
cp -r sv2v $TOOLDIR
rm -rf sv2v
}
yosys()
{
case $OSDIR in
"centos/7") parts=$(eval echo {a..c}) ;;
*) parts=$(eval echo {a..c}) ;;
esac
echo $parts
rm -f yosys.tar.bz2.parta*
for x in $parts
do
wget $REPOSITORY/yosys/$OSDIR/yosys.tar.bz2.parta$x
done
cat yosys.tar.bz2.parta* > yosys.tar.bz2
tar -xvf yosys.tar.bz2
cp -r yosys $TOOLDIR
rm -f yosys.tar.bz2*
rm -rf yosys
}
show_usage()
{
echo "Install Pre-built Vortex Toolchain"
echo "Usage: $0 [[--riscv] [--riscv64] [--llvm-vortex] [--llvm-pocl] [--pocl] [--verilator] [--sv2v] [--yosys] [--all] [-h|--help]]"
}
while [ "$1" != "" ]; do
case $1 in
-pocl ) pocl
--pocl ) pocl
;;
-verilator ) verilator
;;
-riscv ) riscv
;;
-riscv64 ) riscv64
;;
-llvm ) llvm
--verilator ) verilator
;;
-all ) riscv
riscv64
llvm
pocl
verilator
;;
-h | --help ) usage
exit
;;
* ) usage
exit 1
--riscv ) riscv
;;
--riscv64 ) riscv64
;;
--llvm-vortex ) llvm-vortex
;;
--llvm-pocl ) llvm-pocl
;;
--sv2v ) sv2v
;;
--yosys ) yosys
;;
--all ) pocl
verilator
sv2v
yosys
llvm-vortex
riscv
riscv64
;;
-h | --help ) show_usage
exit
;;
* ) show_usage
exit 1
esac
shift
done

128
ci/toolchain_prebuilt.sh Executable file
View File

@@ -0,0 +1,128 @@
#!/bin/bash
# Copyright © 2019-2023
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# exit when any command fails
set -e
TOOLDIR=${TOOLDIR:=/opt}
OSDIR=${OSDIR:=ubuntu/bionic}
riscv()
{
echo "prebuilt riscv-gnu-toolchain..."
tar -C $TOOLDIR -cvjf riscv-gnu-toolchain.tar.bz2 riscv-gnu-toolchain
split -b 50M riscv-gnu-toolchain.tar.bz2 "riscv-gnu-toolchain.tar.bz2.part"
mv riscv-gnu-toolchain.tar.bz2.part* ./riscv-gnu-toolchain/$OSDIR
rm riscv-gnu-toolchain.tar.bz2
}
riscv64()
{
echo "prebuilt riscv64-gnu-toolchain..."
tar -C $TOOLDIR -cvjf riscv64-gnu-toolchain.tar.bz2 riscv64-gnu-toolchain
split -b 50M riscv64-gnu-toolchain.tar.bz2 "riscv64-gnu-toolchain.tar.bz2.part"
mv riscv64-gnu-toolchain.tar.bz2.part* ./riscv64-gnu-toolchain/$OSDIR
rm riscv64-gnu-toolchain.tar.bz2
}
llvm-vortex()
{
echo "prebuilt llvm-vortex..."
tar -C $TOOLDIR -cvjf llvm-vortex.tar.bz2 llvm-vortex
split -b 50M llvm-vortex.tar.bz2 "llvm-vortex.tar.bz2.part"
mv llvm-vortex.tar.bz2.part* ./llvm-vortex/$OSDIR
rm llvm-vortex.tar.bz2
}
llvm-pocl()
{
echo "prebuilt llvm-pocl..."
tar -C $TOOLDIR -cvjf llvm-pocl.tar.bz2 llvm-pocl
split -b 50M llvm-pocl.tar.bz2 "llvm-pocl.tar.bz2.part"
mv llvm-pocl.tar.bz2.part* ./llvm-pocl/$OSDIR
rm llvm-pocl.tar.bz2
}
pocl()
{
echo "prebuilt pocl..."
tar -C $TOOLDIR -cvjf pocl.tar.bz2 pocl
mv pocl.tar.bz2 ./pocl/$OSDIR
}
verilator()
{
echo "prebuilt verilator..."
tar -C $TOOLDIR -cvjf verilator.tar.bz2 verilator
mv verilator.tar.bz2 ./verilator/$OSDIR
}
sv2v()
{
echo "prebuilt sv2v..."
tar -C $TOOLDIR -cvjf sv2v.tar.bz2 sv2v
mv sv2v.tar.bz2 ./sv2v/$OSDIR
}
yosys()
{
echo "prebuilt yosys..."
tar -C $TOOLDIR -cvjf yosys.tar.bz2 yosys
split -b 50M yosys.tar.bz2 "yosys.tar.bz2.part"
mv yosys.tar.bz2.part* ./yosys/$OSDIR
rm yosys.tar.bz2
}
show_usage()
{
echo "Setup Pre-built Vortex Toolchain"
echo "Usage: $0 [[--riscv] [--llvm-vortex] [--llvm-pocl] [--pocl] [--verilator] [--sv2v] [-yosys] [--all] [-h|--help]]"
}
while [ "$1" != "" ]; do
case $1 in
--pocl ) pocl
;;
--verilator ) verilator
;;
--riscv ) riscv
;;
--riscv64 ) riscv64
;;
--llvm-vortex ) llvm-vortex
;;
--llvm-pocl ) llvm-pocl
;;
--sv2v ) sv2v
;;
--yosys ) yosys
;;
--all ) riscv
riscv64
llvm-vortex
llvm-pocl
pocl
verilator
sv2v
yosys
;;
-h | --help ) show_usage
exit
;;
* ) show_usage
exit 1
esac
shift
done

245
ci/trace_csv.py Executable file
View File

@@ -0,0 +1,245 @@
#!/usr/bin/env python3
# Copyright © 2019-2023
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import argparse
import csv
import re
def parse_args():
parser = argparse.ArgumentParser(description='CPU trace log to CSV format converter.')
parser.add_argument('-t', '--type', default='simx', help='log type (rtlsim or simx)')
parser.add_argument('-o', '--csv', default='trace.csv', help='Output CSV file')
parser.add_argument('log', help='Input log file')
return parser.parse_args()
def parse_simx(log_filename):
pc_pattern = r"PC=(0x[0-9a-fA-F]+)"
instr_pattern = r"Instr (0x[0-9a-fA-F]+):"
opcode_pattern = r"Instr 0x[0-9a-fA-F]+: ([0-9a-zA-Z_\.]+)"
core_id_pattern = r"cid=(\d+)"
warp_id_pattern = r"wid=(\d+)"
tmask_pattern = r"tmask=(\d+)"
operands_pattern = r"Src\d+ Reg: (.+)"
destination_pattern = r"Dest Reg: (.+)"
uuid_pattern = r"#(\d+)"
entries = []
with open(log_filename, 'r') as log_file:
instr_data = None
for lineno, line in enumerate(log_file, start=1):
if line.startswith("DEBUG Fetch:"):
if instr_data:
entries.append(instr_data)
instr_data = {}
instr_data["lineno"] = lineno
instr_data["PC"] = re.search(pc_pattern, line).group(1)
instr_data["core_id"] = re.search(core_id_pattern, line).group(1)
instr_data["warp_id"] = re.search(warp_id_pattern, line).group(1)
instr_data["tmask"] = re.search(tmask_pattern, line).group(1)
instr_data["uuid"] = re.search(uuid_pattern, line).group(1)
elif line.startswith("DEBUG Instr"):
instr_data["instr"] = re.search(instr_pattern, line).group(1)
instr_data["opcode"] = re.search(opcode_pattern, line).group(1)
elif line.startswith("DEBUG Src"):
src_reg = re.search(operands_pattern, line).group(1)
instr_data["operands"] = (instr_data["operands"] + ', ' + src_reg) if 'operands' in instr_data else src_reg
elif line.startswith("DEBUG Dest"):
instr_data["destination"] = re.search(destination_pattern, line).group(1)
if instr_data:
entries.append(instr_data)
return entries
def reverse_binary(bin_str):
return bin_str[::-1]
def bin_to_array(bin_str):
return [int(bit) for bit in bin_str]
def append_reg(text, value, sep):
if sep:
text += ", "
ivalue = int(value)
if (ivalue >= 32):
text += "f" + str(ivalue % 32)
else:
text += "x" + value
sep = True
return text, sep
def append_imm(text, value, sep):
if sep:
text += ", "
text += value
sep = True
return text, sep
def append_value(text, reg, value, tmask_arr, sep):
text, sep = append_reg(text, reg, sep)
text += "={"
for i in range(len(tmask_arr)):
if i != 0:
text += ", "
if tmask_arr[i]:
text += value[i]
else:
text +="-"
text += "}"
return text, sep
def parse_rtlsim(log_filename):
line_pattern = r"\d+: core(\d+)-(decode|issue|commit)"
pc_pattern = r"PC=(0x[0-9a-fA-F]+)"
instr_pattern = r"instr=(0x[0-9a-fA-F]+)"
ex_pattern = r"ex=([a-zA-Z]+)"
op_pattern = r"op=([\?0-9a-zA-Z_\.]+)"
warp_id_pattern = r"wid=(\d+)"
tmask_pattern = r"tmask=(\d+)"
wb_pattern = r"wb=(\d)"
opds_pattern = r"opds=(\d+)"
use_imm_pattern = r"use_imm=(\d)"
imm_pattern = r"imm=(0x[0-9a-fA-F]+)"
rd_pattern = r"rd=(\d+)"
rs1_pattern = r"rs1=(\d+)"
rs2_pattern = r"rs2=(\d+)"
rs3_pattern = r"rs3=(\d+)"
rs1_data_pattern = r"rs1_data=\{(.+?)\}"
rs2_data_pattern = r"rs2_data=\{(.+?)\}"
rs3_data_pattern = r"rs3_data=\{(.+?)\}"
rd_data_pattern = r"data=\{(.+?)\}"
eop_pattern = r"eop=(\d)"
uuid_pattern = r"#(\d+)"
entries = []
with open(log_filename, 'r') as log_file:
instr_data = {}
for lineno, line in enumerate(log_file, start=1):
line_match = re.search(line_pattern, line)
if line_match:
PC = re.search(pc_pattern, line).group(1)
warp_id = re.search(warp_id_pattern, line).group(1)
tmask = re.search(tmask_pattern, line).group(1)
uuid = re.search(uuid_pattern, line).group(1)
core_id = line_match.group(1)
stage = line_match.group(2)
if stage == "decode":
trace = {}
trace["uuid"] = uuid
trace["PC"] = PC
trace["core_id"] = core_id
trace["warp_id"] = warp_id
trace["tmask"] = reverse_binary(tmask)
trace["instr"] = re.search(instr_pattern, line).group(1)
trace["opcode"] = re.search(op_pattern, line).group(1)
trace["opds"] = bin_to_array(re.search(opds_pattern, line).group(1))
trace["rd"] = re.search(rd_pattern, line).group(1)
trace["rs1"] = re.search(rs1_pattern, line).group(1)
trace["rs2"] = re.search(rs2_pattern, line).group(1)
trace["rs3"] = re.search(rs3_pattern, line).group(1)
trace["use_imm"] = re.search(use_imm_pattern, line).group(1) == "1"
trace["imm"] = re.search(imm_pattern, line).group(1)
instr_data[uuid] = trace
elif stage == "issue":
if uuid in instr_data:
trace = instr_data[uuid]
trace["lineno"] = lineno
opds = trace["opds"]
if opds[1]:
trace["rs1_data"] = re.search(rs1_data_pattern, line).group(1).split(', ')[::-1]
if opds[2]:
trace["rs2_data"] = re.search(rs2_data_pattern, line).group(1).split(', ')[::-1]
if opds[3]:
trace["rs3_data"] = re.search(rs3_data_pattern, line).group(1).split(', ')[::-1]
trace["issued"] = True
instr_data[uuid] = trace
elif stage == "commit":
if uuid in instr_data:
trace = instr_data[uuid]
if "issued" in trace:
opds = trace["opds"]
dst_tmask_arr = bin_to_array(tmask)[::-1]
wb = re.search(wb_pattern, line).group(1) == "1"
if wb:
rd_data = re.search(rd_data_pattern, line).group(1).split(', ')[::-1]
if 'rd_data' in trace:
merged_rd_data = trace['rd_data']
for i in range(len(dst_tmask_arr)):
if dst_tmask_arr[i] == 1:
merged_rd_data[i] = rd_data[i]
trace['rd_data'] = merged_rd_data
else:
trace['rd_data'] = rd_data
instr_data[uuid] = trace
eop = re.search(eop_pattern, line).group(1) == "1"
if eop:
tmask_arr = bin_to_array(trace["tmask"])
destination = ''
if wb:
destination, sep = append_value(destination, trace["rd"], trace['rd_data'], tmask_arr, False)
del trace['rd_data']
trace["destination"] = destination
operands = ''
sep = False
if opds[1]:
operands, sep = append_value(operands, trace["rs1"], trace["rs1_data"], tmask_arr, sep)
del trace["rs1_data"]
if opds[2]:
operands, sep = append_value(operands, trace["rs2"], trace["rs2_data"], tmask_arr, sep)
del trace["rs2_data"]
if opds[3]:
operands, sep = append_value(operands, trace["rs3"], trace["rs3_data"], tmask_arr, sep)
del trace["rs3_data"]
trace["operands"] = operands
del trace["opds"]
del trace["rd"]
del trace["rs1"]
del trace["rs2"]
del trace["rs3"]
del trace["use_imm"]
del trace["imm"]
del trace["issued"]
del instr_data[uuid]
entries.append(trace)
return entries
def write_csv(log_filename, csv_filename, log_type):
entries = None
# parse log file
if log_type == "rtlsim":
entries = parse_rtlsim(log_filename)
elif log_type == "simx":
entries = parse_simx(log_filename)
else:
print('Error: invalid log type')
sys.exit()
# sort entries by uuid
entries.sort(key=lambda x: (int(x['core_id']), int(x['warp_id']), int(x['lineno'])))
for entry in entries:
del entry['lineno']
# write to CSV
with open(csv_filename, 'w', newline='') as csv_file:
fieldnames = ["uuid", "PC", "opcode", "instr", "core_id", "warp_id", "tmask", "operands", "destination"]
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
for entry in entries:
writer.writerow(entry)
def main():
args = parse_args()
write_csv(args.log, args.csv, args.type)
if __name__ == "__main__":
main()

View File

@@ -1,4 +1,18 @@
#!/usr/bin/env python
# Copyright 2019-2023
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import time
import threading
@@ -7,7 +21,7 @@ import subprocess
# This script executes a long-running command while outputing "still running ..." periodically
# to notify Travis build system that the program has not hanged
PING_INTERVAL=15
PING_INTERVAL=300 # 5 minutes
def monitor(stop):
wait_time = 0
@@ -20,11 +34,11 @@ def monitor(stop):
break
def execute(command):
process = subprocess.Popen(command, stdout=subprocess.PIPE)
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
while True:
output = process.stdout.readline()
if output:
line = output.decode('ascii').rstrip()
line = output.decode('utf-8').rstrip()
print(">>> " + line)
process.stdout.flush()
ret = process.poll()

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 77 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 67 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 463 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 517 KiB

View File

@@ -2,69 +2,26 @@
The Vortex Cache Sub-system has the following main properties:
- High-bandwidth with bank parallelism
- Snoop protocol to flush data for CPU access
- Generic design: Dcache, Icache, Shared Memory, L2 cache, L3 cache
- High-bandwidth transfer with Multi-bank parallelism
- Non-blocking pipelined architecture with local MSHR
- Configurable design: Dcache, Icache, L2 cache, L3 cache
### Cache Hierarchy
### Cache Microarchitecture
![Image of Cache Hierarchy](./assets/img/cache_hierarchy.png)
![Image of Cache Hierarchy](./assets/img/cache_microarchitecture.png)
- Cache can be configured to be any level in the hierarchy
- Caches communicate via snooping
- Cache flush from AFU is passed down the hierarchy
The Vortex cache is comprised of multiple parallel banks. It is comprised of the following modules:
- **Bank request dispatch crossbar**: assign a bank to incoming requests and resolve collision using stalls.
- **Bank response merge crossbar**: merge result from banks and forward to the core response.
- **Memory request multiplexer**: arbitrate bank memory requests
- **Memory response demultiplexer**: forward memory response to the corresponding bank.
- **Flush Unit**: perform tag memory initialization.
### VX_cache.v (Top Module)
Incoming requests entering the cache are sent to a dispatch crossbar that select the corresponding bank for each request, resolving bank collisions with stalls. The result output of each bank is merge back into outgoing response port via merger crossbar. Each bank intergates a non-blocking pipeline with a local Miss Status Holding Register (MSHR) to reduce the miss rate. The bank pipeline consists of the following stages:
VX.cache.v is the top module of the cache verilog code located in the `/hw/rtl/cache` directory.
- **Schedule**: Selects the next request into the pipeline from the incoming core request, memory fill, or the MSHR entry, with priority given to the latter.
- **Tag Access**: A single-port read/write access to the tag store.
- **Data Access**: Single-port read/write access to the data store.
- **Response Handling**: Core response back to the core.
![Image of Vortex Cache](./assets/img/vortex_cache_top_module.png)
- Configurable (Cache size, number of banks, bank line size, etc.)
- I/O signals
- Core Request
- Core Rsp
- DRAM Req
- DRAM Rsp
- Snoop Rsp
- Snoop Rsp
- Snoop Forwarding Out
- Snoop Forwarding In
- Bank Select
- Assigns valid and ready signals for each bank
- Snoop Forwarder
- DRAM Request Arbiter
- Prepares cache response for communication with DRAM
- Snoop Response Arbiter
- Sends snoop response
- Core Response Merge
- Cache accesses one line at a time. As a result, each request may not come back in the same response. This module tries to recombine the responses by thread ID.
### VX_bank.v
VX_bank.v is the verilog code that handles cache bank functionality and is located in the `/hw/rtl/cache` directory.
![Image of Vortex Cache Bank](./assets/img/vortex_bank.png)
- Allows for high throughput
- Each bank contains queues to hold requests to the cache
- I/O signals
- Core request
- Core Response
- DRAM Fill Requests
- DRAM Fill Response
- DRAM WB Requests
- Snp Request
- Snp Response
- Request Priority: DRAM fill, miss reserve, core request, snoop request
- Snoop Request Queue
- DRAM Fill Queue
- Core Req Arbiter
- Requests to be processed by the bank
- Tag Data Store
- Registers for valid, dirty, dirtyb, tag, and data
- Length of registers determined by lines in the bank
- Tag Data Access:
- I/O: stall, snoop info, force request miss
- Writes to cache or sends read response; hit or miss determined here
- A missed request goes to the miss reserve if it is not a snoop request or DRAM fill
Deadlocks inside the cache can occur when the MSHR is full and a new request is already in the pipeline. It can also occur when the memory request queue is full, and there is an incoming memory response. The cache mitigates MSHR deadlocks by using an early full signal before a new request is issued and similarly mitigates memory deadlocks by ensuring that its request queue never fills up.

View File

@@ -3,38 +3,39 @@
The directory/file layout of the Vortex codebase is as followed:
- `hw`:
- `rtl`: hardware rtl sources
- `cache`: cache subsystem code
- `fp_cores`: floating point unit code
- `rtl`: hardware rtl sources
- `core`: core pipeline
- `cache`: cache subsystem
- `mem`: memory subsystem
- `fpu`: floating point unit
- `interfaces`: interfaces for inter-module communication
- `libs`: general-purpose RTL modules
- `libs`: general-purpose RTL modules
- `syn`: synthesis directory
- `opae`: OPAE synthesis scripts
- `quartus`: Quartus synthesis scripts
- `altera`: Altera synthesis scripts
- `xilinx`: Xilinx synthesis scripts
- `synopsys`: Synopsys synthesis scripts
- `modelsim`: Modelsim synthesis scripts
- `yosys`: Yosys synthesis scripts
- `unit_tests`: unit tests for some hardware components
- `driver`: host drivers repository
- `runtime`: host runtime software APIs
- `include`: Vortex driver public headers
- `stub`: Vortex stub driver library
- `fpga`: software driver that uses Intel OPAE FPGA
- `asesim`: software driver that uses Intel ASE simulator
- `vlsim`: software driver that uses vlsim simulator
- `opae`: software driver that uses Intel OPAE API with device targets=fpga|asesim|opaesim
- `xrt`: software driver that uses Xilinx XRT API with device targets=hw|hw_emu|sw_emu
- `rtlsim`: software driver that uses rtlsim simulator
- `simx`: software driver that uses simX simulator
- `runtime`: kernel runtime software
- `kernel`: GPU kernel software APIs
- `include`: Vortex runtime public headers
- `linker`: linker file for compiling kernels
- `src`: runtime implementation
- `sim`:
- `vlsim`: AFU RTL simulator
- `opaesim`: Intel OPAE AFU RTL simulator
- `rtlsim`: processor RTL simulator
- `simX`: cycle approximate simulator for vortex
- `tests`: tests repository.
- `runtime`: runtime tests
- `regression`: regression tests
- `riscv`: RISC-V standard tests
- `riscv`: RISC-V conformance tests
- `kernel`: kernel tests
- `regression`: regression tests
- `opencl`: opencl benchmarks and tests
- `ci`: continuous integration scripts
- `miscs`: miscellaneous resources.

View File

@@ -0,0 +1,36 @@
# Continuous Integration
- Each time you push to the repo, the Continuous Integration pipeline will run
- This pipeline consists of creating the correct development environment, building your code, and running all tests
- This is an extensive pipeline so it might take some time to complete
## Protecting Master Branch
Navigate to your Repository:
Open your repository on GitHub.
Click on "Settings":
In the upper-right corner of your repository page, click on the "Settings" tab.
Select "Branches" in the left sidebar:
On the left sidebar, look for the "Branches" option and click on it.
Choose the Branch:
Under "Branch protection rules," select the branch you want to protect. In this case, choose the main branch.
Enable Branch Protection:``
Check the box that says "Protect this branch."
Configure Protection Settings:
You can configure various protection settings. Some common settings include:
Require pull request reviews before merging: This ensures that changes are reviewed before being merged.
Require status checks to pass before merging: This ensures that automated tests and checks are passing.
Require signed commits: This enforces that commits are signed with a verified signature.
Restrict Who Can Push:
You can further restrict who can push directly to the branch. You might want to limit this privilege to specific people or teams.
Save Changes:
Once you've configured the protection settings, scroll down and click on the "Save changes" button.
Now, your main branch is protected, and certain criteria must be met before changes can be pushed directly to it. Contributors will need to create pull requests, have their changes reviewed, and meet other specified criteria before the changes can be merged into the main branch.

18
docs/contributing.md Normal file
View File

@@ -0,0 +1,18 @@
# Contributing to Vortex on Github
## Github Details
- There are two main repos, `vortex` (public, this one) and `vortex-dev` (private)
- todo: Most current development is on `vortex`
- If you have a legacy version of `vortex`, you can use the releases branch or tags to access the repo at that point in time
## Contribution Process
- You should create a new branch from develop that is clearly named with the feature that you want to add
- Avoid pushing directly to the `master` branch instead you will need to make a Pull Request (PR)
- There should be protections in place that prevent pushing directly to the main branch, but don't rely on it
- When you make a PR it will be tested against the continuous integration (ci) pipeline (see `continuous_integration.md`)
- It is not sufficient to just write some tests, they need to be incorporated into the ci pipeline to make sure they are run
- During a PR, you might receive feedback regarding your changes and you might need to make further commits to your branch
## Creating and Adding Tests
see `testing.md`

View File

@@ -1,29 +1,37 @@
# Debugging Vortex Hardware
# Debugging Vortex GPU
## Testing changes to the RTL or simulator GPU driver.
The Blackbox utility script will not pick up your changes if the h/w configuration is the same as during teh last run.
To force the utility to build the driver, you need pass the --rebuild=1 option when running tests.
Using --rebuild=0 will prevent the rebuild even if the h/w configuration is different from last run.
$ ./ci/blackbox.sh --driver=simx --app=demo --rebuild=1
## SimX Debugging
SimX cycle-approximate simulator allows faster debugging of Vortex kernels' execution.
The recommended method to enable debugging is to pass the `--debug` flag to `blackbox` tool when running a program.
The recommended method to enable debugging is to pass the `--debug=<level>` flag to `blackbox` tool when running a program.
// Running demo program on SimX in debug mode
$ ./ci/blackbox.sh --driver=simx --app=demo --debug
$ ./ci/blackbox.sh --driver=simx --app=demo --debug=1
A debug trace `run.log` is generated in the current directory during the program execution. The trace includes important states of the simulated processor (decoded instruction, register states, pipeline states, etc..). You can increase the verbosity level of the trace by changing the `DEBUG_LEVEL` variable to a value [1-5] (default is 3).
A debug trace `run.log` is generated in the current directory during the program execution. The trace includes important states of the simulated processor (decoded instruction, register states, pipeline states, etc..). You can increase the verbosity of the trace by changing the debug level.
// Using SimX in debug mode with verbose level 4
$ CONFIGS=-DDEBUG_LEVEL=4 ./ci/blackbox.sh --driver=simx --app=demo --debug
// Using SimX in debug mode with verbose level 3
$ ./ci/blackbox.sh --driver=simx --app=demo --debug=3
## RTL Debugging
To debug the processor RTL, you need to use VLSIM or RTLSIM driver. VLSIM simulates the full processor including the AFU command processor (using `/rtl/afu/vortex_afu.sv` as top module). RTLSIM simulates the Vortex processor only (using `/rtl/Vortex.v` as top module).
To debug the processor RTL, you need to use VLSIM or RTLSIM driver. VLSIM simulates the full processor including the AFU command processor (using `/rtl/afu/opae/vortex_afu.sv` as top module). RTLSIM simulates the Vortex processor only (using `/rtl/Vortex.v` as top module).
The recommended method to enable debugging is to pass the `--debug` flag to `blackbox` tool when running a program.
// Running demo program on vlsim in debug mode
$ ./ci/blackbox.sh --driver=vlsim --app=demo --debug
// Running demo program on the opae simulator in debug mode
$ TARGET=opaesim ./ci/blackbox.sh --driver=opae --app=demo --debug=1
// Running demo program on rtlsim in debug mode
$ ./ci/blackbox.sh --driver=rtlsim --app=demo --debug
$ ./ci/blackbox.sh --driver=rtlsim --app=demo --debug=1
A debug trace `run.log` is generated in the current directory during the program execution. The trace includes important states of the simulated processor (memory, caches, pipeline, stalls, etc..). A waveform trace `trace.vcd` is also generated in the current directory during the program execution. You can visualize the waveform trace using any tool that can open VCD files (Modelsim, Quartus, Vivado, etc..). [GTKwave] (http://gtkwave.sourceforge.net) is a great open-source scope analyzer that also works with VCD files.
@@ -32,7 +40,7 @@ A debug trace `run.log` is generated in the current directory during the program
Debugging the FPGA directly may be necessary to investigate runtime bugs that the RTL simulation cannot catch. We have implemented an in-house scope analyzer for Vortex that works when the FPGA is running. To enable the FPGA scope analyzer, the FPGA bitstream should be built using `SCOPE=1` flag
& cd /hw/syn/opae
$ CONFIGS=-DSCOPE=1 make fpga-4c
$ CONFIGS="-DSCOPE=1" TARGET=fpga make
When running the program on the FPGA, you need to pass the `--scope` flag to the `blackbox` tool.
@@ -40,4 +48,18 @@ When running the program on the FPGA, you need to pass the `--scope` flag to the
$ ./ci/blackbox.sh --driver=fpga --app=demo --scope
A waveform trace `trace.vcd` will be generated in the current directory during the program execution. This trace includes a limited set of signals that are defined in `/hw/scripts/scope.json`. You can expand your signals' selection by updating the json file.
A waveform trace `trace.vcd` will be generated in the current directory during the program execution. This trace includes a limited set of signals that are defined in `/hw/scripts/scope.json`. You can expand your signals' selection by updating the json file.
## Analyzing Vortex trace log
When debugging Vortex RTL or SimX Simulator, reading the trace run.log file can be overwhelming when the trace gets really large.
We provide a trace sanitizer tool under ./hw/scripts/trace_csv.py that you can use to convert the large trace into a CSV file containing all the instructions that executed with their source and destination operands.
$ ./ci/blackbox.sh --driver=rtlsim --app=demo --debug=3 --log=run_rtlsim.log
$ ./ci/trace_csv.py -trtlsim run_rtlsim.log -otrace_rtlsim.csv
$ ./ci/blackbox.sh --driver=simx --app=demo --debug=3 --log=run_simx.log
$ ./ci/trace_csv.py -tsimx run_simx.log -otrace_simx.csv
The first column in the CSV trace is UUID (universal unique identifier) of the instruction and the content is sorted by the UUID. You can use the UUID to trace the same instruction running on either the RTL hw or SimX simulator.
This can be very effective if you want to use SimX to debugging your RTL hardware by comparing CSV traces.

45
docs/environment_setup.md Normal file
View File

@@ -0,0 +1,45 @@
# Environment Setup
These instructions apply to the development vortex repo using the updated toolchain. The updated toolchain is considered to be any commit of `master` pulled from July 2, 2023 onwards. The toolchain update in question can be viewed in this [commit](https://github.com/vortexgpgpu/vortex-dev/commit/0048496ba28d7b9a209a0e569d52d60f2b68fc04). Therefore, if you are unsure whether you are using the new toolchain or not, then you should check the `ci` folder for the existence of the `toolchain_prebuilt.sh` script. Furthermore, you should notice that the `toolchain_install.sh` script has the legacy `llvm()` split into `llvm-vortex()` and `llvm-pocl()`.
## Set Up on Your Own System
The toolchain binaries provided with Vortex are built on Ubuntu-based systems. To install Vortex on your own system, [follow these instructions](install_vortex.md).
## Servers for Georgia Tech Students and Collaborators
### Volvo
Volvo is a 64-core server provided by HPArch. You need valid credentials to access it. If you don't already have access, you can get in contact with your mentor to ask about setting your account up.
Setup on Volvo:
1. Connect to Georgia Tech's VPN or ssh into another machine on campus
2. `ssh volvo.cc.gatech.edu`
3. Clone Vortex to your home directory: `git clone --recursive https://github.com/vortexgpgpu/vortex.git`
4. `source /nethome/software/set_vortex_env.sh` to set up the necessary environment variables.
5. `make -s` in the `vortex` root directory
6. Run a test program: `./ci/blackbox.sh --cores=2 --app=dogfood`
### Nio
Nio is a 20-core desktop server provided by HPArch. If you have access to Volvo, you also have access to Nio.
Setup on Nio:
1. Connect to Georgia Tech's VPN or ssh into another machine on campus
2. `ssh nio.cc.gatech.edu`
3. Clone Vortex to your home directory: `git clone --recursive https://github.com/vortexgpgpu/vortex.git`
4. `source /opt/set_vortex_env_dev.sh` to set up the necessary environment variables.
5. `make -s` in the `vortex` root directory
6. Run a test program: `./ci/blackbox.sh --cores=2 --app=dogfood`
## Docker (Experimental)
Docker allows for isolated pre-built environments to be created, shared and used. The emulation mode required for ARM-based processors will incur a decrease in performance. Currently, the dockerfile is not included with the official vortex repository and is not actively maintained or supported.
### Setup with Docker
1. Clone repo recursively onto your local machine: `git clone --recursive https://github.com/vortexgpgpu/vortex.git`
2. Download the dockerfile from [here](https://github.gatech.edu/gist/usubramanya3/f1bf3e953faa38a6372e1292ffd0b65c) and place it in the root of the repo.
3. Build the Dockerfile into an image: `docker build --platform=linux/amd64 -t vortex -f dockerfile .`
4. Run a container based on the image: `docker run --rm -v ./:/root/vortex/ -it --name vtx-dev --privileged=true --platform=linux/amd64 vortex`
5. Install the toolchain `./ci/toolchain_install.sh --all` (once per container)
6. `make -s` in `vortex` root directory
7. Run a test program: `./ci/blackbox.sh --cores=2 --app=dogfood`
You may exit from a container and resume a container you have exited or start a second terminal session `docker exec -it <container-name> bash`

View File

@@ -1,128 +0,0 @@
# Execute OpenCL on Vortex backend
## Requirements
- [Vortex](https://github.com/vortexgpgpu/vortex)
- [POCL for Vortex](https://github.com/vortexgpgpu/pocl)
- [riscv-toolchain](https://github.com/riscv-collab/riscv-gnu-toolchain)
- [llvm-riscv](https://github.com/llvm-mirror/llvm)
For installation, please see [Build Instructions](../README.md) for more details.
**For Ubuntu18.04 users, you can directly download pre-build toolchains with [toolchain_install.sh](https://github.com/vortexgpgpu/vortex/blob/master/ci/toolchain_install.sh) script.**
```bash
# please modify the DESTDIR variable in the script before execution
bash toolchain_install.sh -all
```
Assuming we have installed all dependencies in `/opt` path, we can get the following environment:
```bash
tree -L 2 /opt
'''
/opt/
├── llvm-riscv
│ ├── bin
│ ├── include
│ ├── lib
│ ├── libexec
│ └── share
├── pocl
│ ├── compiler
│ └── runtime
├── riscv-gnu-toolchain
│ ├── bin
│ ├── drops
│ ├── include
│ ├── lib
│ ├── libexec
│ ├── riscv32-unknown-elf
│ ├── share
│ └── var
└── verilator
├── bin
├── examples
├── include
├── verilator-config.cmake
└── verilator-config-version.cmake
'''
```
## Execute OpenCL on Vortex
In this tutorial, we show the example of executing a vecadd programs on SIMX backend.
To execute a OpenCL program on Vortex, we have the following steps:
- Compile the [OpenCL kernels](https://github.com/vortexgpgpu/vortex/blob/master/tests/opencl/vecadd/kernel.cl) into risc-v binary by POCL compiler.
- Compile the [OpenCL host](https://github.com/vortexgpgpu/vortex/blob/master/tests/opencl/vecadd/main.cc) and link with Vortex driver(```-lvortex```).
- Execute the compiled host programs on a backend.
Thus, we can write a Makefile as following:
```Makefile
LLVM_PREFIX ?= /opt/llvm-riscv
RISCV_TOOLCHAIN_PATH ?= /opt/riscv-gnu-toolchain
SYSROOT ?= $(RISCV_TOOLCHAIN_PATH)/riscv32-unknown-elf
POCL_CC_PATH ?= /opt/pocl/compiler
POCL_RT_PATH ?= /opt/pocl/runtime
OPTS ?= -n64
# please edit these two variable to your environment
VORTEX_DRV_PATH ?= $(realpath ../../../driver)
VORTEX_RT_PATH ?= $(realpath ../../../runtime)
K_LLCFLAGS += "-O3 -march=riscv32 -target-abi=ilp32f -mcpu=generic-rv32 -mattr=+m,+f -mattr=+vortex -float-abi=hard -code-model=small"
K_CFLAGS += "-v -O3 --sysroot=$(SYSROOT) --gcc-toolchain=$(RISCV_TOOLCHAIN_PATH) -march=rv32imf -mabi=ilp32f -Xclang -target-feature -Xclang +vortex -I$(VORTEX_RT_PATH)/include -fno-rtti -fno-exceptions -ffreestanding -nostartfiles -fdata-sections -ffunction-sections"
K_LDFLAGS += "-Wl,-Bstatic,-T$(VORTEX_RT_PATH)/linker/vx_link.ld -Wl,--gc-sections $(VORTEX_RT_PATH)/libvortexrt.a -lm"
CXXFLAGS += -std=c++11 -O2 -Wall -Wextra -Wfatal-errors
CXXFLAGS += -Wno-deprecated-declarations -Wno-unused-parameter
CXXFLAGS += -I$(POCL_RT_PATH)/include
LDFLAGS += -L$(POCL_RT_PATH)/lib -L$(VORTEX_DRV_PATH)/stub -lOpenCL -lvortex
PROJECT = vecadd
SRCS = main.cc
all: $(PROJECT) kernel.pocl
kernel.pocl: kernel.cl
LLVM_PREFIX=$(LLVM_PREFIX) POCL_DEBUG=all LD_LIBRARY_PATH=$(LLVM_PREFIX)/lib:$(POCL_CC_PATH)/lib $(POCL_CC_PATH)/bin/poclcc -LLCFLAGS $(K_LLCFLAGS) -CFLAGS $(K_CFLAGS) -LDFLAGS $(K_LDFLAGS) -o kernel.pocl kernel.cl
$(PROJECT): $(SRCS)
$(CXX) $(CXXFLAGS) $^ $(LDFLAGS) -o $@
run-fpga: $(PROJECT) kernel.pocl
LD_LIBRARY_PATH=$(POCL_RT_PATH)/lib:$(VORTEX_DRV_PATH)/fpga:$(LD_LIBRARY_PATH) ./$(PROJECT) $(OPTS)
run-asesim: $(PROJECT) kernel.pocl
LD_LIBRARY_PATH=$(POCL_RT_PATH)/lib:$(VORTEX_DRV_PATH)/asesim:$(LD_LIBRARY_PATH) ./$(PROJECT) $(OPTS)
run-vlsim: $(PROJECT) kernel.pocl
LD_LIBRARY_PATH=$(POCL_RT_PATH)/lib:$(VORTEX_DRV_PATH)/vlsim:$(LD_LIBRARY_PATH) ./$(PROJECT) $(OPTS)
run-simx: $(PROJECT) kernel.pocl
LD_LIBRARY_PATH=$(POCL_RT_PATH)/lib:$(VORTEX_DRV_PATH)/simx:$(LD_LIBRARY_PATH) ./$(PROJECT) $(OPTS)
run-rtlsim: $(PROJECT) kernel.pocl
LD_LIBRARY_PATH=$(POCL_RT_PATH)/lib:$(VORTEX_DRV_PATH)/rtlsim:$(LD_LIBRARY_PATH) ./$(PROJECT) $(OPTS)
.depend: $(SRCS)
$(CXX) $(CXXFLAGS) -MM $^ > .depend;
clean:
rm -rf $(PROJECT) *.o .depend
clean-all: clean
rm -rf *.pocl *.dump
ifneq ($(MAKECMDGOALS),clean)
-include .depend
endif
```
First, build the host program.
```bash
make all
```
If we want to execute on SIMX, we can execute the command below.
```bash
make run-simx
```

View File

@@ -9,30 +9,21 @@ OPAE Environment Setup
$ export C_INCLUDE_PATH=$OPAE_HOME/include:$C_INCLUDE_PATH
$ export LIBRARY_PATH=$OPAE_HOME/lib:$LIBRARY_PATH
$ export LD_LIBRARY_PATH=$OPAE_HOME/lib:$LD_LIBRARY_PATH
$ export RISCV_TOOLCHAIN_PATH=/opt/riscv-gnu-toolchain
$ export PATH=:/opt/verilator/bin:$PATH
$ export VERILATOR_ROOT=/opt/verilator
OPAE Build
------------------
The FPGA has to following configuration options:
- 1 core fpga (fpga-1c)
- 2 cores fpga (fpga-2c)
- 4 cores fpga (fpga-4c)
- 8 cores fpga (fpga-8c)
- 16 cores fpga (fpga-16c)
- 32 cores fpga (fpga-32c)
- 64 cores fpga (fpga-64c)
- DEVICE_FAMILY=arria10 | stratix10
- NUM_CORES=#n
Command line:
$ cd hw/syn/opae
$ make fpga-<num-of-cores>c
$ cd hw/syn/altera/opae
$ PREFIX=test1 TARGET=fpga NUM_CORES=4 make
Example: `make fpga-4c`
A new folder (ex: `build_fpga_4c`) will be created and the build will start and take ~30-480 min to complete.
A new folder (ex: `test1_xxx_4c`) will be created and the build will start and take ~30-480 min to complete.
Setting TARGET=ase will build the project for simulation using Intel ASE.
OPAE Build Configuration
@@ -45,35 +36,32 @@ The hardware configuration file `/hw/rtl/VX_config.vh` defines all the hardware
You configure the syntesis build from the command line:
$ CONFIGS="-DPERF_ENABLE -DNUM_THREADS=8" make fpga-4c
$ CONFIGS="-DPERF_ENABLE -DNUM_THREADS=8" make
OPAE Build Progress
-------------------
You could check the last 10 lines in the build log for possible errors until build completion.
$ tail -n 10 ./build_fpga_<num-of-cores>c/build.log
$ tail -n 10 <build_dir>/build.log
Check if the build is still running by looking for quartus_sh, quartus_syn, or quartus_fit programs.
$ ps -u <username>
If the build fails and you need to restart it, clean up the build folder using the following command:
$ make clean-fpga-<num-of-cores>c
Example: `make clean-fpga-4c`
$ make clean
The file `vortex_afu.gbs` should exist when the build is done:
$ ls -lsa ./build_fpga_<num-of-cores>c/vortex_afu.gbs
$ ls -lsa <build_dir>/vortex_afu.gbs
Signing the bitstream and Programming the FPGA
----------------------------------------------
$ cd ./build_fpga_<num-of-cores>c
$ cd <build_dir>
$ PACSign PR -t UPDATE -H openssl_manager -i vortex_afu.gbs -o vortex_afu_unsigned_ssl.gbs
$ fpgasupdate vortex_afu_unsigned_ssl.gbs

View File

@@ -11,10 +11,10 @@
- [Debugging](debugging.md)
- [Useful Links](references.md)
## Installation
- Refer to the build instructions in [README](../README.md).
- For the different environments Vortex supports, [read this document](environment_setup.md).
- To install on your own system, [follow this document](install_vortex.md).
## Quick Start Scenarios
@@ -22,9 +22,11 @@ Running Vortex simulators with different configurations:
- Run basic driver test with rtlsim driver and Vortex config of 2 clusters, 2 cores, 2 warps, 4 threads
$ ./ci/blackbox.sh --driver=rtlsim --clusters=2 --cores=2 --warps=2 --threads=4 --app=basic
- Run demo driver test with vlsim driver and Vortex config of 1 clusters, 4 cores, 4 warps, 2 threads
$ ./ci/blackbox.sh --driver=vlsim --clusters=1 --cores=4 --warps=4 --threads=2 --app=demo
- Run demo driver test with opae driver and Vortex config of 1 clusters, 4 cores, 4 warps, 2 threads
$ ./ci/blackbox.sh --driver=opae --clusters=1 --cores=4 --warps=4 --threads=2 --app=demo
- Run dogfood driver test with simx driver and Vortex config of 4 cluster, 4 cores, 8 warps, 6 threads
$ ./ci/blackbox.sh --driver=simx --clusters=4 --cores=4 --warps=8 --threads=6 --app=dogfood
$ ./ci/blackbox.sh --driver=simx --clusters=4 --cores=4 --warps=8 --threads=6 --app=dogfood

124
docs/install_vortex.md Normal file
View File

@@ -0,0 +1,124 @@
# Installing and Setting Up the Vortex Environment
## Ubuntu 18.04, 20.04
1. Install the following dependencies:
```
sudo apt-get install build-essential zlib1g-dev libtinfo-dev libncurses5 uuid-dev libboost-serialization-dev libpng-dev libhwloc-dev
```
2. Upgrade gcc to 11:
```
sudo apt-get install gcc-11 g++-11
```
Multiple gcc versions on Ubuntu can be managed with update-alternatives, e.g.:
```
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 9
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 9
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 11
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-11 11
```
3. Download the Vortex codebase:
```
git clone --recursive https://github.com/vortexgpgpu/vortex.git
```
4. Install Vortex's prebuilt toolchain:
```
cd vortex
sudo ./ci/toolchain_install.sh -all
# By default, the toolchain will install to /opt folder. This is recommended, but you can install the toolchain to a different directory by setting DESTDIR.
DESTDIR=$TOOLDIR ./ci/toolchain_install.sh -all
```
5. Set up environment:
```
export VORTEX_HOME=$TOOLDIR/vortex
export LLVM_VORTEX=$TOOLDIR/llvm-vortex
export LLVM_POCL=$TOOLDIR/llvm-pocl
export POCL_CC_PATH=$TOOLDIR/pocl/compiler
export POCL_RT_PATH=$TOOLDIR/pocl/runtime
export RISCV_TOOLCHAIN_PATH=$TOOLDIR/riscv-gnu-toolchain
export VERILATOR_ROOT=$TOOLDIR/verilator
export SV2V_PATH=$TOOLDIR/sv2v
export YOSYS_PATH=$TOOLDIR/yosys
export PATH=$YOSYS_PATH/bin:$SV2V_PATH/bin:$VERILATOR_ROOT/bin:$PATH
```
6. Build Vortex
```
make
```
## RHEL 8
Note: depending on the system, some of the toolchain may need to be recompiled for non-Ubuntu Linux. The source for the tools can be found [here](https://github.com/vortexgpgpu/).
1. Install the following dependencies:
```
sudo yum install libpng-devel boost boost-devel boost-serialization libuuid-devel opencl-headers hwloc hwloc-devel gmp-devel compat-hwloc1
```
2. Upgrade gcc to 11:
```
sudo yum install gcc-toolset-11
```
Multiple gcc versions on Red Hat can be managed with scl
3. Install MPFR 4.2.0:
Download [the source](https://ftp.gnu.org/gnu/mpfr/) and follow [the installation documentation](https://www.mpfr.org/mpfr-current/mpfr.html#How-to-Install).
4. Download the Vortex codebase:
```
git clone --recursive https://github.com/vortexgpgpu/vortex.git
```
5. Install Vortex's prebuilt toolchain:
```
cd vortex
sudo ./ci/toolchain_install.sh -all
# By default, the toolchain will install to /opt folder. This is recommended, but you can install the toolchain to a different directory by setting DESTDIR.
DESTDIR=$TOOLDIR ./ci/toolchain_install.sh -all
```
6. Set up environment:
```
export VORTEX_HOME=$TOOLDIR/vortex
export LLVM_VORTEX=$TOOLDIR/llvm-vortex
export LLVM_POCL=$TOOLDIR/llvm-pocl
export POCL_CC_PATH=$TOOLDIR/pocl/compiler
export POCL_RT_PATH=$TOOLDIR/pocl/runtime
export RISCV_TOOLCHAIN_PATH=$TOOLDIR/riscv-gnu-toolchain
export VERILATOR_ROOT=$TOOLDIR/verilator
export SV2V_PATH=$TOOLDIR/sv2v
export YOSYS_PATH=$TOOLDIR/yosys
export PATH=$YOSYS_PATH/bin:$SV2V_PATH/bin:$VERILATOR_ROOT/bin:$PATH
export LD_LIBRARY_PATH=<path to mpfr>/src/.libs:$LD_LIBRARY_PATH
```
7. Build Vortex
```
make
```

View File

@@ -24,71 +24,57 @@ Vortex uses the SIMT (Single Instruction, Multiple Threads) execution model with
- Control the number of warps to activate during execution
- `WSPAWN` *count, addr*: activate count warps and jump to addr location
- **Control-Flow Divergence**
- Control threads to activate when a branch diverges
- `SPLIT` *predicate*: apply 'taken' predicate thread mask adn save 'not-taken' into IPDOM stack
- `JOIN`: restore 'not-taken' thread mask
- Control threads activation when a branch diverges
- `SPLIT` *taken, predicate*: apply predicate thread mask and save current state into IPDOM stack
- `JOIN`: pop IPDOM stack to restore thread mask
- `PRED` *predicate, restore_mask*: thread predicate instruction
- **Warp Synchronization**
- `BAR` *id, count*: stall warps entering barrier *id* until count is reached
### Vortex Pipeline/Datapath
![Image of Vortex Microarchitecture](./assets/img/vortex_microarchitecture_v2.png)
![Image of Vortex Microarchitecture](./assets/img/vortex_microarchitecture.png)
Vortex has a 5-stage pipeline: FI | ID | Issue | EX | WB.
Vortex has a 6-stage pipeline:
- **Schedule**
- Warp Scheduler
- Schedule the next PC into the pipeline
- Track stalled, active warps
- IPDOM Stack
- Save split/join states for divergent threads
- Inflight Tracker
- Track in-flight instructions
- **Fetch**
- Warp Scheduler
- Track stalled & active warps, resolve branches and barriers, maintain split/join IPDOM stack
- Instruction Cache
- Retrieve instruction from cache, issue I-cache requests/responses
- Retrieve instructions from memory
- Handle I-cache requests/responses
- **Decode**
- Decode fetched instructions, notify warp scheduler when the following instructions are decoded:
- Branch, tmc, split/join, wspawn
- Precompute used_regs mask (needed for Issue stage)
- Decode fetched instructions
- Notify warp scheduler on control instructions
- **Issue**
- Scheduling
- In-order issue (operands/execute unit ready), out-of-order commit
- IBuffer
- Store fetched instructions, separate queues per-warp, selects next warp through round-robin scheduling
- Store decoded instructions in separate per-warp queues
- Scoreboard
- Track in-use registers
- GPRs (General-Purpose Registers) stage
- Fetch issued instruction operands and send operands to execute unit
- Check register use for decoded instructions
- Operands Collector
- Fetch the operands for issued instructions from the register file
- **Execute**
- ALU Unit
- Single-cycle operations (+,-,>>,<<,&,|,^), Branch instructions (Share ALU resources)
- MULDIV Unit
- Multiplier - done in 2 cycles
- Divider - division and remainder, done in 32 cycles
- Implements serial alogrithm (Stalls the pipeline)
- Handle arithmetic and branch operations
- FPU Unit
- Multi-cycle operations, uses `FPnew` Library on ASIC, uses hard DSPs on FPGA
- CSR Unit
- Store constant status registers - device caps, FPU status flags, performance counters
- Handle external CSR requests (requests from host CPU)
- Handle floating-point operations
- LSU Unit
- Handle load/store operations, issue D-cache requests, handle D-cache responses
- Commit load responses - saves storage, Scoreboard tracks completion
- GPGPU Unit
- Handle GPGPU instructions
- TMC, WSPAWN, SPLIT, BAR
- JOIN is handled by Warp Scheduler (upon SPLIT response)
- Handle load/store operations
- SFU Unit
- Handle warp control operations
- Handle Control Status Registers (CSRs) operations
- **Commit**
- Commit
- Update CSR flags, update performance counters
- Writeback
- Write result back to GPRs, notify Scoreboard (release in-use register), select candidate instruction (ALU unit has highest priority)
- **Clustering**
- Group mulitple cores into clusters (optionally share L2 cache)
- Group multiple clusters (optionally share L3 cache)
- Configurable at build time
- Default configuration:
- #Clusters = 1
- #Cores = 4
- #Warps = 4
- #Threads = 4
- **FPGA AFU Interface**
- Manage CPU-GPU comunication
- Query devices caps, load kernel instructions and resource buffers, start kernel execution, read destination buffers
- Local Memory - GPU access to local DRAM
- Reserved I/O addresses - redirect to host CPU, console output
- Write result back to the register file and update the Scoreboard.
### Vortex clustering architecture
- Sockets
- Grouping multiple cores sharing L1 cache
- Clusters
- Grouping of sockets sharing L2 cache

View File

@@ -10,7 +10,7 @@ SimX is a C++ cycle-level in-house simulator developed for Vortex. The relevant
### FGPA Simulation
The current target FPGA for simulation is the Arria10 Intel Accelerator Card v1.0. The guide to build the fpga with specific configurations is located [here.](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/FPGA_Startup_Guide.md)
The current target FPGA for simulation is the Arria10 Intel Accelerator Card v1.0. The guide to build the fpga with specific configurations is located [here.](fpga_setup.md)
### How to Test
@@ -22,15 +22,15 @@ Running tests under specific drivers (rtlsim,simx,fpga) is done using the script
- *Threads* - used to specify the number of threads (smallest unit of computation) within a configuration.
- *L2cache* - used to enable the shard l2cache among the Vortex cores.
- *L3cache* - used to enable the shared l3cache among the Vortex clusters.
- *Driver* - used to specify which driver to run the Vortex simulation (either rtlsim, vlsim, fpga, or simx).
- *Driver* - used to specify which driver to run the Vortex simulation (either rtlsim, opae, xrt, simx).
- *Debug* - used to enable debug mode for the Vortex simulation.
- *Perf* - used to enable the detailed performance counters within the Vortex simulation.
- *App* - used to specify which test/benchmark to run in the Vortex simulation. The main choices are vecadd, sgemm, basic, demo, and dogfood. Other tests/benchmarks are located in the `/benchmarks/opencl` folder though not all of them work wit the current version of Vortex.
- *Args* - used to pass additional arguments to the application.
Example use of command line arguments: Run the sgemm benchmark using the vlsim driver with a Vortex configuration of 1 cluster, 4 cores, 4 warps, and 4 threads.
Example use of command line arguments: Run the sgemm benchmark using the opae driver with a Vortex configuration of 1 cluster, 4 cores, 4 warps, and 4 threads.
$ ./ci/blackbox.sh --clusters=1 --cores=4 --warps=4 --threads=4 --driver=vlsim --app=sgemm
$ ./ci/blackbox.sh --clusters=1 --cores=4 --warps=4 --threads=4 --driver=opae --app=sgemm
Output from terminal:
```

47
docs/testing.md Normal file
View File

@@ -0,0 +1,47 @@
# Testing
## Running a Vortex application
The framework provides a utility script: blackbox.sh under the /ci/ folder for executing applications in the tests tree.
You can query the commandline options of the tool using:
$ ./ci/blackbox.sh --help
To execute sgemm test program on the simx driver and passing "-n10" as argument to sgemm:
$ ./ci/blackbox.sh --driver=simx --app=sgemm --args="-n10"
You can execute the same application of a GPU architecture with 2 cores:
$ ./ci/blackbox.sh --core=2 --driver=simx --app=sgemm --args="-n10"
When excuting, Blackbox needs to recompile the driver if the desired architecture changes.
It tracks the latest configuration in a file under the current directory blackbox.<driver>.cache.
To avoid having to rebuild the driver all the time, Blackbox checks if the latest cached configuration matches the current.
## Running Benchmarks
The Vortex test suite is located under the /test/ folder
You can execute the default regression suite by running the following commands at the root folder.
$ make -C tests/regression run-simx
$ make -C tests/regression run-rtlsim
You can execute the default opncl suite by running the following commands at the root folder.
$ make -C tests/opencl run-simx
$ make -C tests/opencl run-rtlsim
## Creating Your Own Regression Tests
- Inside `test/` you will find a series of folders which are named based on what they test
- You can view the tests to see which ones have tests similar to what you are trying to create new tests for
- once you have found a similar baseline, you can copy the folder and rename it to what you are planning to test
- `testcases.h` contains each of the test case templates
- `main.cpp` contains the implementation of each of the test cases and builds a test suite of all the tests cases you want
Compile the test case: `make -C tests/regression/<testcase-name>/ clean-all && make -C tests/regression/<testcase-name>/`
Run the test case: `./ci/blackbox.sh --driver=simx --cores=4 --app=<testcase-name> --debug`
## Adding Your Tests to the CI Pipeline
see `continuous_integration.md`

View File

@@ -1,29 +0,0 @@
all: stub rtlsim simx vlsim
stub:
$(MAKE) -C stub
fpga:
$(MAKE) -C fpga
asesim:
$(MAKE) -C asesim
vlsim:
$(MAKE) -C vlsim
rtlsim:
$(MAKE) -C rtlsim
simx:
$(MAKE) -C simx
clean:
$(MAKE) clean -C stub
$(MAKE) clean -C fpga
$(MAKE) clean -C asesim
$(MAKE) clean -C vlsim
$(MAKE) clean -C rtlsim
$(MAKE) clean -C simx
.PHONY: all stub fpga asesim vlsim rtlsim simx clean

View File

@@ -1,73 +0,0 @@
OPAE_HOME ?= /tools/opae/1.4.0
RTL_DIR=../../hw/rtl
SCRIPT_DIR=../../hw/scripts
OPAE_SYN_DIR=../../hw/syn/opae
CXXFLAGS += -std=c++11 -Wall -Wextra -pedantic -Wfatal-errors
CXXFLAGS += -I. -I../include -I../../hw -I$(OPAE_HOME)/include -I$(OPAE_SYN_DIR)
LDFLAGS += -L$(OPAE_HOME)/lib -luuid -lopae-c-ase
# stack execution protection
LDFLAGS +=-z noexecstack
# data relocation and projection
LDFLAGS +=-z relro -z now
# stack buffer overrun detection
CXXFLAGS +=-fstack-protector
# Position independent code
CXXFLAGS += -fPIC
# Add external configuration
CXXFLAGS += $(CONFIGS)
# Dump perf stats
CXXFLAGS += -DDUMP_PERF_STATS
LDFLAGS += -shared
PROJECT = libvortex.so
SRCS = ../common/opae.cpp ../common/vx_utils.cpp
# Debugigng
ifdef DEBUG
CXXFLAGS += -g -O0
else
CXXFLAGS += -O2 -DNDEBUG
endif
# Enable scope analyzer
ifdef SCOPE
CXXFLAGS += -DSCOPE
SRCS += ../common/vx_scope.cpp
SCOPE_H = scope-defs.h
endif
# Enable perf counters
ifdef PERF
CXXFLAGS += -DPERF_ENABLE
endif
all: $(PROJECT)
$(OPAE_SYN_DIR)/vortex_afu.h:
$(MAKE) -C $(OPAE_SYN_DIR) vortex_afu.h
scope-defs.h: $(SCRIPT_DIR)/scope.json
$(SCRIPT_DIR)/scope.py $(CONFIGS) -cc scope-defs.h -vl $(RTL_DIR)/scope-defs.vh $(SCRIPT_DIR)/scope.json
# generate scope data
scope: scope-defs.h
$(PROJECT): $(SRCS) $(OPAE_SYN_DIR)/vortex_afu.h $(SCOPE_H)
$(CXX) $(CXXFLAGS) -DUSE_ASE $(SRCS) $(LDFLAGS) -o $(PROJECT)
clean:
rm -rf $(PROJECT) *.o scope-defs.h

View File

@@ -1,535 +0,0 @@
#include <stdint.h>
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <cstdlib>
#include <unistd.h>
#include <assert.h>
#include <cmath>
#include <sstream>
#include <unordered_map>
#include <list>
#if defined(USE_FPGA) || defined(USE_ASE)
#include <opae/fpga.h>
#include <uuid/uuid.h>
#elif defined(USE_VLSIM)
#include <fpga.h>
#endif
#include "vx_utils.h"
#include "vx_malloc.h"
#include <vortex.h>
#include <VX_config.h>
#include "vortex_afu.h"
#ifdef SCOPE
#include "vx_scope.h"
#endif
#define CHECK_RES(_expr) \
do { \
fpga_result res = _expr; \
if (res == FPGA_OK) \
break; \
printf("[VXDRV] Error: '%s' returned %d, %s!\n", \
#_expr, (int)res, fpgaErrStr(res)); \
return -1; \
} while (false)
///////////////////////////////////////////////////////////////////////////////
#define CMD_MEM_READ AFU_IMAGE_CMD_MEM_READ
#define CMD_MEM_WRITE AFU_IMAGE_CMD_MEM_WRITE
#define CMD_RUN AFU_IMAGE_CMD_RUN
#define MMIO_CMD_TYPE (AFU_IMAGE_MMIO_CMD_TYPE * 4)
#define MMIO_IO_ADDR (AFU_IMAGE_MMIO_IO_ADDR * 4)
#define MMIO_MEM_ADDR (AFU_IMAGE_MMIO_MEM_ADDR * 4)
#define MMIO_DATA_SIZE (AFU_IMAGE_MMIO_DATA_SIZE * 4)
#define MMIO_DEV_CAPS (AFU_IMAGE_MMIO_DEV_CAPS * 4)
#define MMIO_STATUS (AFU_IMAGE_MMIO_STATUS * 4)
#define STATUS_STATE_BITS 8
///////////////////////////////////////////////////////////////////////////////
class vx_device {
public:
vx_device()
: mem_allocator(
ALLOC_BASE_ADDR,
ALLOC_BASE_ADDR + LOCAL_MEM_SIZE,
4096,
CACHE_BLOCK_SIZE)
{}
~vx_device() {}
fpga_handle fpga;
vortex::MemoryAllocator mem_allocator;
unsigned version;
unsigned num_cores;
unsigned num_warps;
unsigned num_threads;
};
typedef struct vx_buffer_ {
uint64_t wsid;
void* host_ptr;
uint64_t io_addr;
vx_device_h hdevice;
uint64_t size;
} vx_buffer_t;
///////////////////////////////////////////////////////////////////////////////
#ifdef DUMP_PERF_STATS
class AutoPerfDump {
private:
std::list<vx_device_h> devices_;
public:
AutoPerfDump() {}
~AutoPerfDump() {
for (auto device : devices_) {
vx_dump_perf(device, stdout);
}
}
void add_device(vx_device_h device) {
devices_.push_back(device);
}
void remove_device(vx_device_h device) {
devices_.remove(device);
}
};
AutoPerfDump gAutoPerfDump;
#endif
///////////////////////////////////////////////////////////////////////////////
extern int vx_dev_caps(vx_device_h hdevice, uint32_t caps_id, uint64_t *value) {
if (nullptr == hdevice)
return -1;
vx_device *device = ((vx_device*)hdevice);
switch (caps_id) {
case VX_CAPS_VERSION:
*value = device->version;
break;
case VX_CAPS_MAX_CORES:
*value = device->num_cores;
break;
case VX_CAPS_MAX_WARPS:
*value = device->num_warps;
break;
case VX_CAPS_MAX_THREADS:
*value = device->num_threads;
break;
case VX_CAPS_CACHE_LINE_SIZE:
*value = CACHE_BLOCK_SIZE;
break;
case VX_CAPS_LOCAL_MEM_SIZE:
*value = LOCAL_MEM_SIZE;
break;
case VX_CAPS_ALLOC_BASE_ADDR:
*value = ALLOC_BASE_ADDR;
break;
case VX_CAPS_KERNEL_BASE_ADDR:
*value = STARTUP_ADDR;
break;
default:
fprintf(stderr, "[VXDRV] Error: invalid caps id: %d\n", caps_id);
std::abort();
return -1;
}
return 0;
}
extern int vx_dev_open(vx_device_h* hdevice) {
if (nullptr == hdevice)
return -1;
fpga_handle accel_handle;
vx_device* device;
#ifndef USE_VLSIM
fpga_result res;
fpga_token accel_token;
fpga_properties filter = nullptr;
fpga_guid guid;
uint32_t num_matches;
// Set up a filter that will search for an accelerator
CHECK_RES(fpgaGetProperties(nullptr, &filter));
res = fpgaPropertiesSetObjectType(filter, FPGA_ACCELERATOR);
if (res != FPGA_OK) {
fprintf(stderr, "[VXDRV] Error: fpgaGetProperties() returned %d, %s!\n", (int)res, fpgaErrStr(res));
fpgaDestroyProperties(&filter);
return -1;
}
// Add the desired UUID to the filter
uuid_parse(AFU_ACCEL_UUID, guid);
res = fpgaPropertiesSetGUID(filter, guid);
if (res != FPGA_OK) {
fprintf(stderr, "[VXDRV] Error: fpgaPropertiesSetGUID() returned %d, %s!\n", (int)res, fpgaErrStr(res));
fpgaDestroyProperties(&filter);
return -1;
}
// Do the search across the available FPGA contexts
num_matches = 1;
res = fpgaEnumerate(&filter, 1, &accel_token, 1, &num_matches);
if (res != FPGA_OK) {
fprintf(stderr, "[VXDRV] Error: fpgaEnumerate() returned %d, %s!\n", (int)res, fpgaErrStr(res));
fpgaDestroyProperties(&filter);
return -1;
}
// Not needed anymore
fpgaDestroyProperties(&filter);
if (num_matches < 1) {
fprintf(stderr, "[VXDRV] Error: accelerator %s not found!\n", AFU_ACCEL_UUID);
fpgaDestroyToken(&accel_token);
return -1;
}
// Open accelerator
res = fpgaOpen(accel_token, &accel_handle, 0);
if (res != FPGA_OK) {
fprintf(stderr, "[VXDRV] Error: fpgaOpen() returned %d, %s!\n", (int)res, fpgaErrStr(res));
fpgaDestroyToken(&accel_token);
return -1;
}
// Done with token
fpgaDestroyToken(&accel_token);
#else
// Open accelerator
CHECK_RES(fpgaOpen(NULL, &accel_handle, 0));
#endif
// allocate device object
device = new vx_device();
if (nullptr == device) {
fpgaClose(accel_handle);
return -1;
}
device->fpga = accel_handle;
{
// Load device CAPS
uint64_t dev_caps;
int ret = fpgaReadMMIO64(device->fpga, 0, MMIO_DEV_CAPS, &dev_caps);
if (ret != FPGA_OK) {
fpgaClose(accel_handle);
return ret;
}
device->version = (dev_caps >> 0) & 0xffff;
device->num_cores = (dev_caps >> 16) & 0xffff;
device->num_warps = (dev_caps >> 32) & 0xffff;
device->num_threads = (dev_caps >> 48) & 0xffff;
#ifndef NDEBUG
fprintf(stdout, "[VXDRV] DEVCAPS: version=%d, num_cores=%d, num_warps=%d, num_threads=%d\n",
device->version, device->num_cores, device->num_warps, device->num_threads);
#endif
}
#ifdef SCOPE
{
int ret = vx_scope_start(accel_handle, 0, -1);
if (ret != 0) {
fpgaClose(accel_handle);
return ret;
}
}
#endif
*hdevice = device;
#ifdef DUMP_PERF_STATS
gAutoPerfDump.add_device(*hdevice);
#endif
return 0;
}
extern int vx_dev_close(vx_device_h hdevice) {
if (nullptr == hdevice)
return -1;
vx_device *device = ((vx_device*)hdevice);
#ifdef SCOPE
vx_scope_stop(device->fpga);
#endif
#ifdef DUMP_PERF_STATS
gAutoPerfDump.remove_device(hdevice);
vx_dump_perf(hdevice, stdout);
#endif
fpgaClose(device->fpga);
delete device;
return 0;
}
extern int vx_mem_alloc(vx_device_h hdevice, uint64_t size, uint64_t* dev_maddr) {
if (nullptr == hdevice
|| nullptr == dev_maddr
|| 0 >= size)
return -1;
vx_device *device = ((vx_device*)hdevice);
return device->mem_allocator.allocate(size, dev_maddr);
}
extern int vx_mem_free(vx_device_h hdevice, uint64_t dev_maddr) {
if (nullptr == hdevice)
return -1;
vx_device *device = ((vx_device*)hdevice);
return device->mem_allocator.release(dev_maddr);
}
extern int vx_buf_alloc(vx_device_h hdevice, uint64_t size, vx_buffer_h* hbuffer) {
fpga_result res;
void* host_ptr;
uint64_t wsid;
uint64_t io_addr;
vx_buffer_t* buffer;
if (nullptr == hdevice
|| 0 >= size
|| nullptr == hbuffer)
return -1;
vx_device *device = ((vx_device*)hdevice);
size_t asize = aligned_size(size, CACHE_BLOCK_SIZE);
res = fpgaPrepareBuffer(device->fpga, asize, &host_ptr, &wsid, 0);
if (FPGA_OK != res) {
return -1;
}
// Get the physical address of the buffer in the accelerator
res = fpgaGetIOAddress(device->fpga, wsid, &io_addr);
if (FPGA_OK != res) {
fpgaReleaseBuffer(device->fpga, wsid);
return -1;
}
// allocate buffer object
buffer = (vx_buffer_t*)malloc(sizeof(vx_buffer_t));
if (nullptr == buffer) {
fpgaReleaseBuffer(device->fpga, wsid);
return -1;
}
buffer->wsid = wsid;
buffer->host_ptr = host_ptr;
buffer->io_addr = io_addr;
buffer->hdevice = hdevice;
buffer->size = asize;
*hbuffer = buffer;
return 0;
}
extern void* vx_host_ptr(vx_buffer_h hbuffer) {
if (nullptr == hbuffer)
return nullptr;
vx_buffer_t* buffer = ((vx_buffer_t*)hbuffer);
return buffer->host_ptr;
}
extern int vx_buf_free(vx_buffer_h hbuffer) {
if (nullptr == hbuffer)
return -1;
vx_buffer_t* buffer = ((vx_buffer_t*)hbuffer);
vx_device *device = ((vx_device*)buffer->hdevice);
fpgaReleaseBuffer(device->fpga, buffer->wsid);
free(buffer);
return 0;
}
extern int vx_ready_wait(vx_device_h hdevice, uint64_t timeout) {
if (nullptr == hdevice)
return -1;
std::unordered_map<uint32_t, std::stringstream> print_bufs;
vx_device *device = ((vx_device*)hdevice);
struct timespec sleep_time;
#if defined(USE_ASE)
sleep_time.tv_sec = 1;
sleep_time.tv_nsec = 0;
#else
sleep_time.tv_sec = 0;
sleep_time.tv_nsec = 1000000;
#endif
// to milliseconds
uint64_t sleep_time_ms = (sleep_time.tv_sec * 1000) + (sleep_time.tv_nsec / 1000000);
for (;;) {
uint64_t status;
CHECK_RES(fpgaReadMMIO64(device->fpga, 0, MMIO_STATUS, &status));
// check for console data
uint32_t cout_data = status >> STATUS_STATE_BITS;
if (cout_data & 0x1) {
// retrieve console data
do {
char cout_char = (cout_data >> 1) & 0xff;
uint32_t cout_tid = (cout_data >> 9) & 0xff;
auto& ss_buf = print_bufs[cout_tid];
ss_buf << cout_char;
if (cout_char == '\n') {
std::cout << std::dec << "#" << cout_tid << ": " << ss_buf.str() << std::flush;
ss_buf.str("");
}
CHECK_RES(fpgaReadMMIO64(device->fpga, 0, MMIO_STATUS, &status));
cout_data = status >> STATUS_STATE_BITS;
} while (cout_data & 0x1);
}
uint32_t state = status & ((1 << STATUS_STATE_BITS)-1);
if (0 == state || 0 == timeout) {
for (auto& buf : print_bufs) {
auto str = buf.second.str();
if (!str.empty()) {
std::cout << "#" << buf.first << ": " << str << std::endl;
}
}
if (state != 0) {
fprintf(stdout, "[VXDRV] ready-wait timed out: state=%d\n", state);
}
break;
}
nanosleep(&sleep_time, nullptr);
timeout -= sleep_time_ms;
};
return 0;
}
extern int vx_copy_to_dev(vx_buffer_h hbuffer, uint64_t dev_maddr, uint64_t size, uint64_t src_offset) {
if (nullptr == hbuffer
|| 0 >= size)
return -1;
vx_buffer_t *buffer = ((vx_buffer_t*)hbuffer);
vx_device *device = ((vx_device*)buffer->hdevice);
uint64_t dev_mem_size = LOCAL_MEM_SIZE;
uint64_t asize = aligned_size(size, CACHE_BLOCK_SIZE);
// check alignment
if (!is_aligned(dev_maddr, CACHE_BLOCK_SIZE))
return -1;
if (!is_aligned(buffer->io_addr + src_offset, CACHE_BLOCK_SIZE))
return -1;
// bound checking
if (src_offset + asize > buffer->size)
return -1;
if (dev_maddr + asize > dev_mem_size)
return -1;
// Ensure ready for new command
if (vx_ready_wait(buffer->hdevice, MAX_TIMEOUT) != 0)
return -1;
auto ls_shift = (int)std::log2(CACHE_BLOCK_SIZE);
CHECK_RES(fpgaWriteMMIO64(device->fpga, 0, MMIO_IO_ADDR, (buffer->io_addr + src_offset) >> ls_shift));
CHECK_RES(fpgaWriteMMIO64(device->fpga, 0, MMIO_MEM_ADDR, dev_maddr >> ls_shift));
CHECK_RES(fpgaWriteMMIO64(device->fpga, 0, MMIO_DATA_SIZE, asize >> ls_shift));
CHECK_RES(fpgaWriteMMIO64(device->fpga, 0, MMIO_CMD_TYPE, CMD_MEM_WRITE));
// Wait for the write operation to finish
if (vx_ready_wait(buffer->hdevice, MAX_TIMEOUT) != 0)
return -1;
return 0;
}
extern int vx_copy_from_dev(vx_buffer_h hbuffer, uint64_t dev_maddr, uint64_t size, uint64_t dest_offset) {
if (nullptr == hbuffer
|| 0 >= size)
return -1;
vx_buffer_t *buffer = ((vx_buffer_t*)hbuffer);
vx_device *device = ((vx_device*)buffer->hdevice);
uint64_t dev_mem_size = LOCAL_MEM_SIZE;
uint64_t asize = aligned_size(size, CACHE_BLOCK_SIZE);
// check alignment
if (!is_aligned(dev_maddr, CACHE_BLOCK_SIZE))
return -1;
if (!is_aligned(buffer->io_addr + dest_offset, CACHE_BLOCK_SIZE))
return -1;
// bound checking
if (dest_offset + asize > buffer->size)
return -1;
if (dev_maddr + asize > dev_mem_size)
return -1;
// Ensure ready for new command
if (vx_ready_wait(buffer->hdevice, MAX_TIMEOUT) != 0)
return -1;
auto ls_shift = (int)std::log2(CACHE_BLOCK_SIZE);
CHECK_RES(fpgaWriteMMIO64(device->fpga, 0, MMIO_IO_ADDR, (buffer->io_addr + dest_offset) >> ls_shift));
CHECK_RES(fpgaWriteMMIO64(device->fpga, 0, MMIO_MEM_ADDR, dev_maddr >> ls_shift));
CHECK_RES(fpgaWriteMMIO64(device->fpga, 0, MMIO_DATA_SIZE, asize >> ls_shift));
CHECK_RES(fpgaWriteMMIO64(device->fpga, 0, MMIO_CMD_TYPE, CMD_MEM_READ));
// Wait for the write operation to finish
if (vx_ready_wait(buffer->hdevice, MAX_TIMEOUT) != 0)
return -1;
return 0;
}
extern int vx_start(vx_device_h hdevice) {
if (nullptr == hdevice)
return -1;
vx_device *device = ((vx_device*)hdevice);
// Ensure ready for new command
if (vx_ready_wait(hdevice, MAX_TIMEOUT) != 0)
return -1;
// start execution
CHECK_RES(fpgaWriteMMIO64(device->fpga, 0, MMIO_CMD_TYPE, CMD_RUN));
return 0;
}

View File

@@ -1,250 +0,0 @@
#include "vx_scope.h"
#include <iostream>
#include <fstream>
#include <thread>
#include <chrono>
#include <vector>
#include <assert.h>
#include <chrono>
#include <thread>
#include <mutex>
#include <VX_config.h>
#include <vortex_afu.h>
#include <scope-defs.h>
#define FRAME_FLUSH_SIZE 100
#define CHECK_RES(_expr) \
do { \
fpga_result res = _expr; \
if (res == FPGA_OK) \
break; \
printf("OPAE Error: '%s' returned %d, %s!\n", \
#_expr, (int)res, fpgaErrStr(res)); \
return -1; \
} while (false)
#define MMIO_SCOPE_READ (AFU_IMAGE_MMIO_SCOPE_READ * 4)
#define MMIO_SCOPE_WRITE (AFU_IMAGE_MMIO_SCOPE_WRITE * 4)
#define CMD_GET_VALID 0
#define CMD_GET_DATA 1
#define CMD_GET_WIDTH 2
#define CMD_GET_COUNT 3
#define CMD_SET_START 4
#define CMD_SET_STOP 5
#define CMD_GET_OFFSET 6
static constexpr int num_modules = sizeof(scope_modules) / sizeof(scope_module_t);
static constexpr int num_taps = sizeof(scope_taps) / sizeof(scope_tap_t);
constexpr int calcFrameWidth(int index = 0) {
return (index < num_taps) ? (scope_taps[index].width + calcFrameWidth(index + 1)) : 0;
}
static constexpr int fwidth = calcFrameWidth();
#ifdef HANG_TIMEOUT
static std::thread g_timeout_thread;
static std::mutex g_timeout_mutex;
static void timeout_callback(fpga_handle fpga) {
std::this_thread::sleep_for(std::chrono::seconds{HANG_TIMEOUT});
vx_scope_stop(fpga);
fpgaClose(fpga);
exit(0);
}
#endif
uint64_t print_clock(std::ofstream& ofs, uint64_t delta, uint64_t timestamp) {
while (delta != 0) {
ofs << '#' << timestamp++ << std::endl;
ofs << "b0 0" << std::endl;
ofs << '#' << timestamp++ << std::endl;
ofs << "b1 0" << std::endl;
--delta;
}
return timestamp;
}
void dump_taps(std::ofstream& ofs, int module) {
for (int i = 0; i < num_taps; ++i) {
auto& tap = scope_taps[i];
if (tap.module != module)
continue;
ofs << "$var reg " << tap.width << " " << (i + 1) << " " << tap.name << " $end" << std::endl;
}
}
void dump_module(std::ofstream& ofs, int parent) {
for (auto& module : scope_modules) {
if (module.parent != parent)
continue;
if (module.name[0] == '*') {
ofs << "$var reg 1 0 clk $end" << std::endl;
} else {
ofs << "$scope module " << module.name << " $end" << std::endl;
}
dump_module(ofs, module.index);
dump_taps(ofs, module.index);
if (module.name[0] != '*') {
ofs << "$upscope $end" << std::endl;
}
}
}
int vx_scope_start(fpga_handle hfpga, uint64_t start_time, uint64_t stop_time) {
if (nullptr == hfpga)
return -1;
if (stop_time != uint64_t(-1)) {
// set stop time
uint64_t cmd_stop = ((stop_time << 3) | CMD_SET_STOP);
CHECK_RES(fpgaWriteMMIO64(hfpga, 0, MMIO_SCOPE_WRITE, cmd_stop));
std::cout << "scope stop time: " << std::dec << stop_time << "s" << std::endl;
}
// start recording
uint64_t cmd_delay = ((start_time << 3) | CMD_SET_START);
CHECK_RES(fpgaWriteMMIO64(hfpga, 0, MMIO_SCOPE_WRITE, cmd_delay));
std::cout << "scope start time: " << std::dec << start_time << "s" << std::endl;
#ifdef HANG_TIMEOUT
g_timeout_thread = std::thread(timeout_callback, hfpga);
g_timeout_thread.detach();
#endif
return 0;
}
int vx_scope_stop(fpga_handle hfpga) {
#ifdef HANG_TIMEOUT
if (!g_timeout_mutex.try_lock())
return 0;
#endif
if (nullptr == hfpga)
return -1;
// forced stop
uint64_t cmd_stop = ((0 << 3) | CMD_SET_STOP);
CHECK_RES(fpgaWriteMMIO64(hfpga, 0, MMIO_SCOPE_WRITE, cmd_stop));
std::cout << "scope trace dump begin..." << std::endl;
std::ofstream ofs("trace.vcd");
ofs << "$version Generated by Vortex Scope $end" << std::endl;
ofs << "$timescale 1 ns $end" << std::endl;
ofs << "$scope module TOP $end" << std::endl;
dump_module(ofs, -1);
dump_taps(ofs, -1);
ofs << "$upscope $end" << std::endl;
ofs << "enddefinitions $end" << std::endl;
uint64_t frame_width, max_frames, data_valid, offset, delta;
uint64_t timestamp = 0;
uint64_t frame_offset = 0;
uint64_t frame_no = 0;
int signal_id = 0;
int signal_offset = 0;
// wait for recording to terminate
CHECK_RES(fpgaWriteMMIO64(hfpga, 0, MMIO_SCOPE_WRITE, CMD_GET_VALID));
do {
CHECK_RES(fpgaReadMMIO64(hfpga, 0, MMIO_SCOPE_READ, &data_valid));
if (data_valid)
break;
std::this_thread::sleep_for(std::chrono::seconds(1));
} while (true);
// get frame width
CHECK_RES(fpgaWriteMMIO64(hfpga, 0, MMIO_SCOPE_WRITE, CMD_GET_WIDTH));
CHECK_RES(fpgaReadMMIO64(hfpga, 0, MMIO_SCOPE_READ, &frame_width));
std::cout << "scope::frame_width=" << std::dec << frame_width << std::endl;
if (fwidth != (int)frame_width) {
std::cerr << "invalid frame_width: expecting " << std::dec << fwidth << "!" << std::endl;
std::abort();
}
// get max frames
CHECK_RES(fpgaWriteMMIO64(hfpga, 0, MMIO_SCOPE_WRITE, CMD_GET_COUNT));
CHECK_RES(fpgaReadMMIO64(hfpga, 0, MMIO_SCOPE_READ, &max_frames));
std::cout << "scope::max_frames=" << std::dec << max_frames << std::endl;
// get offset
CHECK_RES(fpgaWriteMMIO64(hfpga, 0, MMIO_SCOPE_WRITE, CMD_GET_OFFSET));
CHECK_RES(fpgaReadMMIO64(hfpga, 0, MMIO_SCOPE_READ, &offset));
// get data
CHECK_RES(fpgaWriteMMIO64(hfpga, 0, MMIO_SCOPE_WRITE, CMD_GET_DATA));
// print clock header
CHECK_RES(fpgaReadMMIO64(hfpga, 0, MMIO_SCOPE_READ, &delta));
timestamp = print_clock(ofs, offset + delta + 2, timestamp);
signal_id = num_taps;
std::vector<char> signal_data(frame_width+1);
do {
if (frame_no == (max_frames-1)) {
// verify last frame is valid
CHECK_RES(fpgaWriteMMIO64(hfpga, 0, MMIO_SCOPE_WRITE, CMD_GET_VALID));
CHECK_RES(fpgaReadMMIO64(hfpga, 0, MMIO_SCOPE_READ, &data_valid));
assert(data_valid == 1);
CHECK_RES(fpgaWriteMMIO64(hfpga, 0, MMIO_SCOPE_WRITE, CMD_GET_DATA));
}
// read next data words
uint64_t word;
CHECK_RES(fpgaReadMMIO64(hfpga, 0, MMIO_SCOPE_READ, &word));
do {
int signal_width = scope_taps[signal_id-1].width;
int word_offset = frame_offset % 64;
signal_data[signal_width - signal_offset - 1] = ((word >> word_offset) & 0x1) ? '1' : '0';
++signal_offset;
++frame_offset;
if (signal_offset == signal_width) {
signal_data[signal_width] = 0; // string null termination
ofs << 'b' << signal_data.data() << ' ' << signal_id << std::endl;
signal_offset = 0;
--signal_id;
}
if (frame_offset == frame_width) {
assert(0 == signal_offset);
frame_offset = 0;
++frame_no;
if (frame_no != max_frames) {
// print clock header
CHECK_RES(fpgaReadMMIO64(hfpga, 0, MMIO_SCOPE_READ, &delta));
timestamp = print_clock(ofs, delta + 1, timestamp);
signal_id = num_taps;
if (0 == (frame_no % FRAME_FLUSH_SIZE)) {
ofs << std::flush;
std::cout << "*** " << frame_no << "/" << max_frames << " frames" << std::endl;
}
}
}
} while ((frame_offset % 64) != 0);
} while (frame_no != max_frames);
std::cout << "scope trace dump done! - " << (timestamp/2) << " cycles" << std::endl;
// verify data not valid
CHECK_RES(fpgaWriteMMIO64(hfpga, 0, MMIO_SCOPE_WRITE, CMD_GET_VALID));
CHECK_RES(fpgaReadMMIO64(hfpga, 0, MMIO_SCOPE_READ, &data_valid));
assert(data_valid == 0);
return 0;
}

View File

@@ -1,19 +0,0 @@
#pragma once
#include <stdint.h>
#ifdef USE_VLSIM
#include <fpga.h>
#else
#include <opae/fpga.h>
#endif
#if defined(USE_FPGA)
#define HANG_TIMEOUT 60
#else
#define HANG_TIMEOUT (30*60)
#endif
int vx_scope_start(fpga_handle hfpga, uint64_t start_time = 0, uint64_t stop_time = -1);
int vx_scope_stop(fpga_handle hfpga);

View File

@@ -1,356 +0,0 @@
#include "vx_utils.h"
#include <iostream>
#include <fstream>
#include <cstring>
#include <vortex.h>
#include <VX_config.h>
#include <assert.h>
uint64_t aligned_size(uint64_t size, uint64_t alignment) {
assert(0 == (alignment & (alignment - 1)));
return (size + alignment - 1) & ~(alignment - 1);
}
bool is_aligned(uint64_t addr, uint64_t alignment) {
assert(0 == (alignment & (alignment - 1)));
return 0 == (addr & (alignment - 1));
}
extern int vx_upload_kernel_bytes(vx_device_h device, const void* content, uint64_t size) {
int err = 0;
if (NULL == content || 0 == size)
return -1;
uint32_t buffer_transfer_size = 65536; // 64 KB
uint64_t kernel_base_addr;
err = vx_dev_caps(device, VX_CAPS_KERNEL_BASE_ADDR, &kernel_base_addr);
if (err != 0)
return -1;
// allocate device buffer
vx_buffer_h buffer;
err = vx_buf_alloc(device, buffer_transfer_size, &buffer);
if (err != 0)
return -1;
// get buffer address
auto buf_ptr = (uint8_t*)vx_host_ptr(buffer);
//
// upload content
//
uint64_t offset = 0;
while (offset < size) {
auto chunk_size = std::min<uint64_t>(buffer_transfer_size, size - offset);
std::memcpy(buf_ptr, (uint8_t*)content + offset, chunk_size);
/*printf("*** Upload Kernel to 0x%0x: data=", kernel_base_addr + offset);
for (int i = 0, n = ((chunk_size+7)/8); i < n; ++i) {
printf("%08x", ((uint64_t*)((uint8_t*)content + offset))[n-1-i]);
}
printf("\n");*/
err = vx_copy_to_dev(buffer, kernel_base_addr + offset, chunk_size, 0);
if (err != 0) {
vx_buf_free(buffer);
return err;
}
offset += chunk_size;
}
vx_buf_free(buffer);
return 0;
}
extern int vx_upload_kernel_file(vx_device_h device, const char* filename) {
std::ifstream ifs(filename);
if (!ifs) {
std::cout << "error: " << filename << " not found" << std::endl;
return -1;
}
// read file content
ifs.seekg(0, ifs.end);
auto size = ifs.tellg();
auto content = new char [size];
ifs.seekg(0, ifs.beg);
ifs.read(content, size);
// upload
int err = vx_upload_kernel_bytes(device, content, size);
// release buffer
delete[] content;
return err;
}
/*static uint32_t get_csr_32(const uint32_t* buffer, int addr) {
uint32_t value_lo = buffer[addr - CSR_MPM_BASE];
return value_lo;
}*/
static uint64_t get_csr_64(const uint32_t* buffer, int addr) {
uint32_t value_lo = buffer[addr - CSR_MPM_BASE];
uint32_t value_hi = buffer[addr - CSR_MPM_BASE + 32];
return (uint64_t(value_hi) << 32) | value_lo;
}
extern int vx_dump_perf(vx_device_h device, FILE* stream) {
int ret = 0;
uint64_t instrs = 0;
uint64_t cycles = 0;
#ifdef PERF_ENABLE
// PERF: pipeline stalls
uint64_t ibuffer_stalls = 0;
uint64_t scoreboard_stalls = 0;
uint64_t lsu_stalls = 0;
uint64_t fpu_stalls = 0;
uint64_t csr_stalls = 0;
uint64_t alu_stalls = 0;
uint64_t gpu_stalls = 0;
// PERF: decode
uint64_t loads = 0;
uint64_t stores = 0;
uint64_t branches = 0;
// PERF: Icache
uint64_t icache_reads = 0;
uint64_t icache_read_misses = 0;
// PERF: Dcache
uint64_t dcache_reads = 0;
uint64_t dcache_writes = 0;
uint64_t dcache_read_misses = 0;
uint64_t dcache_write_misses = 0;
uint64_t dcache_bank_stalls = 0;
uint64_t dcache_mshr_stalls = 0;
// PERF: shared memory
uint64_t smem_reads = 0;
uint64_t smem_writes = 0;
uint64_t smem_bank_stalls = 0;
// PERF: memory
uint64_t mem_reads = 0;
uint64_t mem_writes = 0;
uint64_t mem_lat = 0;
#ifdef EXT_TEX_ENABLE
// PERF: texunit
uint64_t tex_mem_reads = 0;
uint64_t tex_mem_lat = 0;
#endif
#endif
uint64_t num_cores;
ret = vx_dev_caps(device, VX_CAPS_MAX_CORES, &num_cores);
if (ret != 0)
return ret;
vx_buffer_h staging_buf;
ret = vx_buf_alloc(device, 64 * sizeof(uint32_t), &staging_buf);
if (ret != 0)
return ret;
auto staging_ptr = (uint32_t*)vx_host_ptr(staging_buf);
for (unsigned core_id = 0; core_id < num_cores; ++core_id) {
ret = vx_copy_from_dev(staging_buf, IO_CSR_ADDR + 64 * sizeof(uint32_t) * core_id, 64 * sizeof(uint32_t), 0);
if (ret != 0) {
vx_buf_free(staging_buf);
return ret;
}
uint64_t instrs_per_core = get_csr_64(staging_ptr, CSR_MINSTRET);
uint64_t cycles_per_core = get_csr_64(staging_ptr, CSR_MCYCLE);
float IPC = (float)(double(instrs_per_core) / double(cycles_per_core));
if (num_cores > 1) fprintf(stream, "PERF: core%d: instrs=%ld, cycles=%ld, IPC=%f\n", core_id, instrs_per_core, cycles_per_core, IPC);
instrs += instrs_per_core;
cycles = std::max<uint64_t>(cycles_per_core, cycles);
#ifdef PERF_ENABLE
// PERF: pipeline
// ibuffer_stall
uint64_t ibuffer_stalls_per_core = get_csr_64(staging_ptr, CSR_MPM_IBUF_ST);
if (num_cores > 1) fprintf(stream, "PERF: core%d: ibuffer stalls=%ld\n", core_id, ibuffer_stalls_per_core);
ibuffer_stalls += ibuffer_stalls_per_core;
// scoreboard_stall
uint64_t scoreboard_stalls_per_core = get_csr_64(staging_ptr, CSR_MPM_SCRB_ST);
if (num_cores > 1) fprintf(stream, "PERF: core%d: scoreboard stalls=%ld\n", core_id, scoreboard_stalls_per_core);
scoreboard_stalls += scoreboard_stalls_per_core;
// alu_stall
uint64_t alu_stalls_per_core = get_csr_64(staging_ptr, CSR_MPM_ALU_ST);
if (num_cores > 1) fprintf(stream, "PERF: core%d: alu unit stalls=%ld\n", core_id, alu_stalls_per_core);
alu_stalls += alu_stalls_per_core;
// lsu_stall
uint64_t lsu_stalls_per_core = get_csr_64(staging_ptr, CSR_MPM_LSU_ST);
if (num_cores > 1) fprintf(stream, "PERF: core%d: lsu unit stalls=%ld\n", core_id, lsu_stalls_per_core);
lsu_stalls += lsu_stalls_per_core;
// csr_stall
uint64_t csr_stalls_per_core = get_csr_64(staging_ptr, CSR_MPM_CSR_ST);
if (num_cores > 1) fprintf(stream, "PERF: core%d: csr unit stalls=%ld\n", core_id, csr_stalls_per_core);
csr_stalls += csr_stalls_per_core;
// fpu_stall
uint64_t fpu_stalls_per_core = get_csr_64(staging_ptr, CSR_MPM_FPU_ST);
if (num_cores > 1) fprintf(stream, "PERF: core%d: fpu unit stalls=%ld\n", core_id, fpu_stalls_per_core);
fpu_stalls += fpu_stalls_per_core;
// gpu_stall
uint64_t gpu_stalls_per_core = get_csr_64(staging_ptr, CSR_MPM_GPU_ST);
if (num_cores > 1) fprintf(stream, "PERF: core%d: gpu unit stalls=%ld\n", core_id, gpu_stalls_per_core);
gpu_stalls += gpu_stalls_per_core;
// PERF: decode
// loads
uint64_t loads_per_core = get_csr_64(staging_ptr, CSR_MPM_LOADS);
if (num_cores > 1) fprintf(stream, "PERF: core%d: loads=%ld\n", core_id, loads_per_core);
loads += loads_per_core;
// stores
uint64_t stores_per_core = get_csr_64(staging_ptr, CSR_MPM_STORES);
if (num_cores > 1) fprintf(stream, "PERF: core%d: stores=%ld\n", core_id, stores_per_core);
stores += stores_per_core;
// branches
uint64_t branches_per_core = get_csr_64(staging_ptr, CSR_MPM_BRANCHES);
if (num_cores > 1) fprintf(stream, "PERF: core%d: branches=%ld\n", core_id, branches_per_core);
branches += branches_per_core;
// PERF: Icache
// total reads
uint64_t icache_reads_per_core = get_csr_64(staging_ptr, CSR_MPM_ICACHE_READS);
if (num_cores > 1) fprintf(stream, "PERF: core%d: icache reads=%ld\n", core_id, icache_reads_per_core);
icache_reads += icache_reads_per_core;
// read misses
uint64_t icache_miss_r_per_core = get_csr_64(staging_ptr, CSR_MPM_ICACHE_MISS_R);
int icache_read_hit_ratio = (int)((1.0 - (double(icache_miss_r_per_core) / double(icache_reads_per_core))) * 100);
if (num_cores > 1) fprintf(stream, "PERF: core%d: icache misses=%ld (hit ratio=%d%%)\n", core_id, icache_miss_r_per_core, icache_read_hit_ratio);
icache_read_misses += icache_miss_r_per_core;
// PERF: Dcache
// total reads
uint64_t dcache_reads_per_core = get_csr_64(staging_ptr, CSR_MPM_DCACHE_READS);
if (num_cores > 1) fprintf(stream, "PERF: core%d: dcache reads=%ld\n", core_id, dcache_reads_per_core);
dcache_reads += dcache_reads_per_core;
// total write
uint64_t dcache_writes_per_core = get_csr_64(staging_ptr, CSR_MPM_DCACHE_WRITES);
if (num_cores > 1) fprintf(stream, "PERF: core%d: dcache writes=%ld\n", core_id, dcache_writes_per_core);
dcache_writes += dcache_writes_per_core;
// read misses
uint64_t dcache_miss_r_per_core = get_csr_64(staging_ptr, CSR_MPM_DCACHE_MISS_R);
int dcache_read_hit_ratio = (int)((1.0 - (double(dcache_miss_r_per_core) / double(dcache_reads_per_core))) * 100);
if (num_cores > 1) fprintf(stream, "PERF: core%d: dcache read misses=%ld (hit ratio=%d%%)\n", core_id, dcache_miss_r_per_core, dcache_read_hit_ratio);
dcache_read_misses += dcache_miss_r_per_core;
// read misses
uint64_t dcache_miss_w_per_core = get_csr_64(staging_ptr, CSR_MPM_DCACHE_MISS_W);
int dcache_write_hit_ratio = (int)((1.0 - (double(dcache_miss_w_per_core) / double(dcache_writes_per_core))) * 100);
if (num_cores > 1) fprintf(stream, "PERF: core%d: dcache write misses=%ld (hit ratio=%d%%)\n", core_id, dcache_miss_w_per_core, dcache_write_hit_ratio);
dcache_write_misses += dcache_miss_w_per_core;
// bank_stalls
uint64_t dcache_bank_st_per_core = get_csr_64(staging_ptr, CSR_MPM_DCACHE_BANK_ST);
int dcache_bank_utilization = (int)((double(dcache_reads_per_core + dcache_writes_per_core) / double(dcache_reads_per_core + dcache_writes_per_core + dcache_bank_st_per_core)) * 100);
if (num_cores > 1) fprintf(stream, "PERF: core%d: dcache bank stalls=%ld (utilization=%d%%)\n", core_id, dcache_bank_st_per_core, dcache_bank_utilization);
dcache_bank_stalls += dcache_bank_st_per_core;
// mshr_stalls
uint64_t dcache_mshr_st_per_core = get_csr_64(staging_ptr, CSR_MPM_DCACHE_MSHR_ST);
if (num_cores > 1) fprintf(stream, "PERF: core%d: dcache mshr stalls=%ld\n", core_id, dcache_mshr_st_per_core);
dcache_mshr_stalls += dcache_mshr_st_per_core;
// PERF: SMEM
// total reads
uint64_t smem_reads_per_core = get_csr_64(staging_ptr, CSR_MPM_SMEM_READS);
if (num_cores > 1) fprintf(stream, "PERF: core%d: smem reads=%ld\n", core_id, smem_reads_per_core);
smem_reads += smem_reads_per_core;
// total write
uint64_t smem_writes_per_core = get_csr_64(staging_ptr, CSR_MPM_SMEM_WRITES);
if (num_cores > 1) fprintf(stream, "PERF: core%d: smem writes=%ld\n", core_id, smem_writes_per_core);
smem_writes += smem_writes_per_core;
// bank_stalls
uint64_t smem_bank_st_per_core = get_csr_64(staging_ptr, CSR_MPM_SMEM_BANK_ST);
int smem_bank_utilization = (int)((double(smem_reads_per_core + smem_writes_per_core) / double(smem_reads_per_core + smem_writes_per_core + smem_bank_st_per_core)) * 100);
if (num_cores > 1) fprintf(stream, "PERF: core%d: smem bank stalls=%ld (utilization=%d%%)\n", core_id, smem_bank_st_per_core, smem_bank_utilization);
smem_bank_stalls += smem_bank_st_per_core;
// PERF: memory
uint64_t mem_reads_per_core = get_csr_64(staging_ptr, CSR_MPM_MEM_READS);
uint64_t mem_writes_per_core = get_csr_64(staging_ptr, CSR_MPM_MEM_WRITES);
uint64_t mem_lat_per_core = get_csr_64(staging_ptr, CSR_MPM_MEM_LAT);
int mem_avg_lat = (int)(double(mem_lat_per_core) / double(mem_reads_per_core));
if (num_cores > 1) fprintf(stream, "PERF: core%d: memory requests=%ld (reads=%ld, writes=%ld)\n", core_id, (mem_reads_per_core + mem_writes_per_core), mem_reads_per_core, mem_writes_per_core);
if (num_cores > 1) fprintf(stream, "PERF: core%d: memory latency=%d cycles\n", core_id, mem_avg_lat);
mem_reads += mem_reads_per_core;
mem_writes += mem_writes_per_core;
mem_lat += mem_lat_per_core;
#ifdef EXT_TEX_ENABLE
// total reads
uint64_t tex_reads_per_core = get_csr_64(staging_ptr, CSR_MPM_TEX_READS);
if (num_cores > 1) fprintf(stream, "PERF: core%d: tex memory reads=%ld\n", core_id, tex_reads_per_core);
tex_mem_reads += tex_reads_per_core;
// read latency
uint64_t tex_lat_per_core = get_csr_64(staging_ptr, CSR_MPM_TEX_LAT);
int tex_avg_lat = (int)(double(tex_lat_per_core) / double(tex_reads_per_core));
if (num_cores > 1) fprintf(stream, "PERF: core%d: tex memory latency=%d cycles\n", core_id, tex_avg_lat);
tex_mem_lat += tex_lat_per_core;
#endif
#endif
}
float IPC = (float)(double(instrs) / double(cycles));
fprintf(stream, "PERF: instrs=%ld, cycles=%ld, IPC=%f\n", instrs, cycles, IPC);
#ifdef PERF_ENABLE
int icache_read_hit_ratio = (int)((1.0 - (double(icache_read_misses) / double(icache_reads))) * 100);
int dcache_read_hit_ratio = (int)((1.0 - (double(dcache_read_misses) / double(dcache_reads))) * 100);
int dcache_write_hit_ratio = (int)((1.0 - (double(dcache_write_misses) / double(dcache_writes))) * 100);
int dcache_bank_utilization = (int)((double(dcache_reads + dcache_writes) / double(dcache_reads + dcache_writes + dcache_bank_stalls)) * 100);
int smem_bank_utilization = (int)((double(smem_reads + smem_writes) / double(smem_reads + smem_writes + smem_bank_stalls)) * 100);
int mem_avg_lat = (int)(double(mem_lat) / double(mem_reads));
fprintf(stream, "PERF: ibuffer stalls=%ld\n", ibuffer_stalls);
fprintf(stream, "PERF: scoreboard stalls=%ld\n", scoreboard_stalls);
fprintf(stream, "PERF: alu unit stalls=%ld\n", alu_stalls);
fprintf(stream, "PERF: lsu unit stalls=%ld\n", lsu_stalls);
fprintf(stream, "PERF: csr unit stalls=%ld\n", csr_stalls);
fprintf(stream, "PERF: fpu unit stalls=%ld\n", fpu_stalls);
fprintf(stream, "PERF: gpu unit stalls=%ld\n", gpu_stalls);
fprintf(stream, "PERF: loads=%ld\n", loads);
fprintf(stream, "PERF: stores=%ld\n", stores);
fprintf(stream, "PERF: branches=%ld\n", branches);
fprintf(stream, "PERF: icache reads=%ld\n", icache_reads);
fprintf(stream, "PERF: icache read misses=%ld (hit ratio=%d%%)\n", icache_read_misses, icache_read_hit_ratio);
fprintf(stream, "PERF: dcache reads=%ld\n", dcache_reads);
fprintf(stream, "PERF: dcache writes=%ld\n", dcache_writes);
fprintf(stream, "PERF: dcache read misses=%ld (hit ratio=%d%%)\n", dcache_read_misses, dcache_read_hit_ratio);
fprintf(stream, "PERF: dcache write misses=%ld (hit ratio=%d%%)\n", dcache_write_misses, dcache_write_hit_ratio);
fprintf(stream, "PERF: dcache bank stalls=%ld (utilization=%d%%)\n", dcache_bank_stalls, dcache_bank_utilization);
fprintf(stream, "PERF: dcache mshr stalls=%ld\n", dcache_mshr_stalls);
fprintf(stream, "PERF: smem reads=%ld\n", smem_reads);
fprintf(stream, "PERF: smem writes=%ld\n", smem_writes);
fprintf(stream, "PERF: smem bank stalls=%ld (utilization=%d%%)\n", smem_bank_stalls, smem_bank_utilization);
fprintf(stream, "PERF: memory requests=%ld (reads=%ld, writes=%ld)\n", (mem_reads + mem_writes), mem_reads, mem_writes);
fprintf(stream, "PERF: memory average latency=%d cycles\n", mem_avg_lat);
#ifdef EXT_TEX_ENABLE
int tex_avg_lat = (int)(double(tex_mem_lat) / double(tex_mem_reads));
fprintf(stream, "PERF: tex memory reads=%ld\n", tex_mem_reads);
fprintf(stream, "PERF: tex memory latency=%d cycles\n", tex_avg_lat);
#endif
#endif
// release allocated resources
vx_buf_free(staging_buf);
return ret;
}
// Deprecated API functions
extern int vx_alloc_shared_mem(vx_device_h hdevice, uint64_t size, vx_buffer_h* hbuffer) {
return vx_buf_alloc(hdevice, size, hbuffer);
}
extern int vx_buf_release(vx_buffer_h hbuffer) {
return vx_buf_free(hbuffer);
}
extern int vx_alloc_dev_mem(vx_device_h hdevice, uint64_t size, uint64_t* dev_maddr) {
return vx_mem_alloc(hdevice, size, dev_maddr);
}

View File

@@ -1,11 +0,0 @@
#pragma once
#include <cstdint>
uint64_t aligned_size(uint64_t size, uint64_t alignment);
bool is_aligned(uint64_t addr, uint64_t alignment);
#define CACHE_BLOCK_SIZE 64
#define ALLOC_BASE_ADDR 0x00000000
#define LOCAL_MEM_SIZE 4294967296 // 4 GB

View File

@@ -1,75 +0,0 @@
OPAE_HOME ?= /tools/opae/1.4.0
RTL_DIR=../../hw/rtl
SCRIPT_DIR=../../hw/scripts
OPAE_SYN_DIR=../../hw/syn/opae
CXXFLAGS += -std=c++11 -Wall -Wextra -pedantic -Wfatal-errors
CXXFLAGS += -I. -I../include -I../../hw -I$(OPAE_HOME)/include -I$(OPAE_SYN_DIR)
LDFLAGS += -L$(OPAE_HOME)/lib -luuid -lopae-c
#SCOPE=1
# stack execution protection
LDFLAGS +=-z noexecstack
# data relocation and projection
LDFLAGS +=-z relro -z now
# stack buffer overrun detection
CXXFLAGS +=-fstack-protector
# Position independent code
CXXFLAGS += -fPIC
# Add external configuration
CXXFLAGS += $(CONFIGS)
# Dump perf stats
CXXFLAGS += -DDUMP_PERF_STATS
LDFLAGS += -shared
PROJECT = libvortex.so
SRCS = ../common/opae.cpp ../common/vx_utils.cpp
# Debugigng
ifdef DEBUG
CXXFLAGS += -g -O0
else
CXXFLAGS += -O2 -DNDEBUG
endif
# Enable scope analyzer
ifdef SCOPE
CXXFLAGS += -DSCOPE
SRCS += ../common/vx_scope.cpp
SCOPE_H = scope-defs.h
endif
# Enable perf counters
ifdef PERF
CXXFLAGS += -DPERF_ENABLE
endif
all: $(PROJECT)
$(OPAE_SYN_DIR)/vortex_afu.h:
$(MAKE) -C $(OPAE_SYN_DIR) vortex_afu.h
scope-defs.h: $(SCRIPT_DIR)/scope.json
$(SCRIPT_DIR)/scope.py $(CONFIGS) -cc scope-defs.h -vl $(RTL_DIR)/scope-defs.vh $(SCRIPT_DIR)/scope.json
# generate scope data
scope: scope-defs.h
$(PROJECT): $(SRCS) $(OPAE_SYN_DIR)/vortex_afu.h $(SCOPE_H)
$(CXX) $(CXXFLAGS) -DUSE_FPGA $^ $(LDFLAGS) -o $(PROJECT)
clean:
rm -rf $(PROJECT) *.o scope-defs.h

View File

@@ -1,84 +0,0 @@
#ifndef __VX_DRIVER_H__
#define __VX_DRIVER_H__
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#ifdef __cplusplus
extern "C" {
#endif
typedef void* vx_device_h;
typedef void* vx_buffer_h;
// device caps ids
#define VX_CAPS_VERSION 0x0
#define VX_CAPS_MAX_CORES 0x1
#define VX_CAPS_MAX_WARPS 0x2
#define VX_CAPS_MAX_THREADS 0x3
#define VX_CAPS_CACHE_LINE_SIZE 0x4
#define VX_CAPS_LOCAL_MEM_SIZE 0x5
#define VX_CAPS_ALLOC_BASE_ADDR 0x6
#define VX_CAPS_KERNEL_BASE_ADDR 0x7
#define MAX_TIMEOUT (60*60*1000) // 1hr
// open the device and connect to it
int vx_dev_open(vx_device_h* hdevice);
// Close the device when all the operations are done
int vx_dev_close(vx_device_h hdevice);
// return device configurations
int vx_dev_caps(vx_device_h hdevice, uint32_t caps_id, uint64_t *value);
// Allocate shared buffer with device
int vx_buf_alloc(vx_device_h hdevice, uint64_t size, vx_buffer_h* hbuffer);
// release buffer
int vx_buf_free(vx_buffer_h hbuffer);
// Get host pointer address
void* vx_host_ptr(vx_buffer_h hbuffer);
// allocate device memory and return address
int vx_mem_alloc(vx_device_h hdevice, uint64_t size, uint64_t* dev_maddr);
// release device memory
int vx_mem_free(vx_device_h hdevice, uint64_t dev_maddr);
// Copy bytes from buffer to device local memory
int vx_copy_to_dev(vx_buffer_h hbuffer, uint64_t dev_maddr, uint64_t size, uint64_t src_offset);
// Copy bytes from device local memory to buffer
int vx_copy_from_dev(vx_buffer_h hbuffer, uint64_t dev_maddr, uint64_t size, uint64_t dst_offset);
// Start device execution
int vx_start(vx_device_h hdevice);
// Wait for device ready with milliseconds timeout
int vx_ready_wait(vx_device_h hdevice, uint64_t timeout);
////////////////////////////// UTILITY FUNCIONS ///////////////////////////////
// upload kernel bytes to device
int vx_upload_kernel_bytes(vx_device_h device, const void* content, uint64_t size);
// upload kernel file to device
int vx_upload_kernel_file(vx_device_h device, const char* filename);
// dump performance counters
int vx_dump_perf(vx_device_h device, FILE* stream);
//////////////////////////// DEPRECATED FUNCTIONS /////////////////////////////
int vx_alloc_dev_mem(vx_device_h hdevice, uint64_t size, uint64_t* dev_maddr);
int vx_alloc_shared_mem(vx_device_h hdevice, uint64_t size, vx_buffer_h* hbuffer);
int vx_buf_release(vx_buffer_h hbuffer);
#ifdef __cplusplus
}
#endif
#endif // __VX_DRIVER_H__

View File

@@ -1,2 +0,0 @@
obj_dir
*.so

View File

@@ -1,43 +0,0 @@
RTLSIM_DIR = ../../sim/rtlsim
CXXFLAGS += -std=c++11 -Wall -Wextra -pedantic -Wfatal-errors
CXXFLAGS += -I../include -I../common -I../../hw -I$(RTLSIM_DIR) -I$(RTLSIM_DIR)/../common
# Position independent code
CXXFLAGS += -fPIC
# Add external configuration
CXXFLAGS += $(CONFIGS)
# Dump perf stats
CXXFLAGS += -DDUMP_PERF_STATS
LDFLAGS += -shared -pthread
LDFLAGS += -L. -lrtlsim
SRCS = vortex.cpp ../common/vx_utils.cpp
# Debugigng
ifdef DEBUG
CXXFLAGS += -g -O0
else
CXXFLAGS += -O2 -DNDEBUG
endif
# Enable perf counters
ifdef PERF
CXXFLAGS += -DPERF_ENABLE
endif
PROJECT = libvortex.so
all: $(PROJECT)
$(PROJECT): $(SRCS)
DESTDIR=../../driver/rtlsim $(MAKE) -C $(RTLSIM_DIR) ../../driver/rtlsim/librtlsim.so
$(CXX) $(CXXFLAGS) $(SRCS) $(LDFLAGS) -o $(PROJECT)
clean:
DESTDIR=../../driver/rtlsim $(MAKE) -C $(RTLSIM_DIR) clean
rm -rf $(PROJECT) *.o

View File

@@ -1,355 +0,0 @@
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <iostream>
#include <future>
#include <list>
#include <chrono>
#include <vortex.h>
#include <vx_malloc.h>
#include <vx_utils.h>
#include <VX_config.h>
#include <mem.h>
#include <util.h>
#include <processor.h>
#define RAM_PAGE_SIZE 4096
using namespace vortex;
///////////////////////////////////////////////////////////////////////////////
class vx_device;
class vx_buffer {
public:
vx_buffer(uint64_t size, vx_device* device)
: size_(size)
, device_(device) {
auto aligned_asize = aligned_size(size, CACHE_BLOCK_SIZE);
data_ = malloc(aligned_asize);
}
~vx_buffer() {
if (data_) {
free(data_);
}
}
void* data() const {
return data_;
}
uint64_t size() const {
return size_;
}
vx_device* device() const {
return device_;
}
private:
uint64_t size_;
vx_device* device_;
void* data_;
};
///////////////////////////////////////////////////////////////////////////////
class vx_device {
public:
vx_device()
: ram_(RAM_PAGE_SIZE)
, mem_allocator_(
ALLOC_BASE_ADDR,
ALLOC_BASE_ADDR + LOCAL_MEM_SIZE,
RAM_PAGE_SIZE,
CACHE_BLOCK_SIZE)
{
processor_.attach_ram(&ram_);
}
~vx_device() {
if (future_.valid()) {
future_.wait();
}
}
int alloc_local_mem(uint64_t size, uint64_t* dev_maddr) {
return mem_allocator_.allocate(size, dev_maddr);
}
int free_local_mem(uint64_t dev_maddr) {
return mem_allocator_.release(dev_maddr);
}
int upload(const void* src, uint64_t dest_addr, uint64_t size, uint64_t src_offset) {
uint64_t asize = aligned_size(size, CACHE_BLOCK_SIZE);
if (dest_addr + asize > LOCAL_MEM_SIZE)
return -1;
/*printf("VXDRV: upload %ld bytes from 0x%lx:", size, uintptr_t((uint8_t*)src + src_offset));
for (int i = 0; i < (asize / CACHE_BLOCK_SIZE); ++i) {
printf("\n0x%08lx=", dest_addr + i * CACHE_BLOCK_SIZE);
for (int j = 0; j < CACHE_BLOCK_SIZE; ++j) {
printf("%02x", *((uint8_t*)src + src_offset + i * CACHE_BLOCK_SIZE + CACHE_BLOCK_SIZE - 1 - j));
}
}
printf("\n");*/
ram_.write((const uint8_t*)src + src_offset, dest_addr, asize);
return 0;
}
int download(void* dest, uint64_t src_addr, uint64_t size, uint64_t dest_offset) {
uint64_t asize = aligned_size(size, CACHE_BLOCK_SIZE);
if (src_addr + asize > LOCAL_MEM_SIZE)
return -1;
ram_.read((uint8_t*)dest + dest_offset, src_addr, asize);
/*printf("VXDRV: download %ld bytes to 0x%lx:", size, uintptr_t((uint8_t*)dest + dest_offset));
for (int i = 0; i < (asize / CACHE_BLOCK_SIZE); ++i) {
printf("\n0x%08lx=", src_addr + i * CACHE_BLOCK_SIZE);
for (int j = 0; j < CACHE_BLOCK_SIZE; ++j) {
printf("%02x", *((uint8_t*)dest + dest_offset + i * CACHE_BLOCK_SIZE + CACHE_BLOCK_SIZE - 1 - j));
}
}
printf("\n");*/
return 0;
}
int start() {
// ensure prior run completed
if (future_.valid()) {
future_.wait();
}
// start new run
future_ = std::async(std::launch::async, [&]{
processor_.run();
});
return 0;
}
int wait(uint64_t timeout) {
if (!future_.valid())
return 0;
uint64_t timeout_sec = timeout / 1000;
std::chrono::seconds wait_time(1);
for (;;) {
// wait for 1 sec and check status
auto status = future_.wait_for(wait_time);
if (status == std::future_status::ready
|| 0 == timeout_sec--)
break;
}
return 0;
}
private:
RAM ram_;
Processor processor_;
MemoryAllocator mem_allocator_;
std::future<void> future_;
};
///////////////////////////////////////////////////////////////////////////////
#ifdef DUMP_PERF_STATS
class AutoPerfDump {
private:
std::list<vx_device_h> devices_;
public:
AutoPerfDump() {}
~AutoPerfDump() {
for (auto device : devices_) {
vx_dump_perf(device, stdout);
}
}
void add_device(vx_device_h device) {
devices_.push_back(device);
}
void remove_device(vx_device_h device) {
devices_.remove(device);
}
};
AutoPerfDump gAutoPerfDump;
#endif
///////////////////////////////////////////////////////////////////////////////
extern int vx_dev_caps(vx_device_h hdevice, uint32_t caps_id, uint64_t *value) {
if (nullptr == hdevice)
return -1;
switch (caps_id) {
case VX_CAPS_VERSION:
*value = IMPLEMENTATION_ID;
break;
case VX_CAPS_MAX_CORES:
*value = NUM_CORES * NUM_CLUSTERS;
break;
case VX_CAPS_MAX_WARPS:
*value = NUM_WARPS;
break;
case VX_CAPS_MAX_THREADS:
*value = NUM_THREADS;
break;
case VX_CAPS_CACHE_LINE_SIZE:
*value = CACHE_BLOCK_SIZE;
break;
case VX_CAPS_LOCAL_MEM_SIZE:
*value = LOCAL_MEM_SIZE;
break;
case VX_CAPS_ALLOC_BASE_ADDR:
*value = ALLOC_BASE_ADDR;
break;
case VX_CAPS_KERNEL_BASE_ADDR:
*value = STARTUP_ADDR;
break;
default:
std::cout << "invalid caps id: " << caps_id << std::endl;
std::abort();
return -1;
}
return 0;
}
extern int vx_dev_open(vx_device_h* hdevice) {
if (nullptr == hdevice)
return -1;
*hdevice = new vx_device();
#ifdef DUMP_PERF_STATS
gAutoPerfDump.add_device(*hdevice);
#endif
return 0;
}
extern int vx_dev_close(vx_device_h hdevice) {
if (nullptr == hdevice)
return -1;
vx_device *device = ((vx_device*)hdevice);
#ifdef DUMP_PERF_STATS
gAutoPerfDump.remove_device(hdevice);
vx_dump_perf(hdevice, stdout);
#endif
delete device;
return 0;
}
extern int vx_mem_alloc(vx_device_h hdevice, uint64_t size, uint64_t* dev_maddr) {
if (nullptr == hdevice
|| nullptr == dev_maddr
|| 0 >= size)
return -1;
vx_device *device = ((vx_device*)hdevice);
return device->alloc_local_mem(size, dev_maddr);
}
extern int vx_mem_free(vx_device_h hdevice, uint64_t dev_maddr) {
if (nullptr == hdevice)
return -1;
vx_device *device = ((vx_device*)hdevice);
return device->free_local_mem(dev_maddr);
}
extern int vx_buf_alloc(vx_device_h hdevice, uint64_t size, vx_buffer_h* hbuffer) {
if (nullptr == hdevice
|| 0 >= size
|| nullptr == hbuffer)
return -1;
vx_device *device = ((vx_device*)hdevice);
auto buffer = new vx_buffer(size, device);
if (nullptr == buffer->data()) {
delete buffer;
return -1;
}
*hbuffer = buffer;
return 0;
}
extern void* vx_host_ptr(vx_buffer_h hbuffer) {
if (nullptr == hbuffer)
return nullptr;
vx_buffer* buffer = ((vx_buffer*)hbuffer);
return buffer->data();
}
extern int vx_buf_free(vx_buffer_h hbuffer) {
if (nullptr == hbuffer)
return -1;
vx_buffer* buffer = ((vx_buffer*)hbuffer);
delete buffer;
return 0;
}
extern int vx_copy_to_dev(vx_buffer_h hbuffer, uint64_t dev_maddr, uint64_t size, uint64_t src_offset) {
if (nullptr == hbuffer
|| 0 >= size)
return -1;
auto buffer = (vx_buffer*)hbuffer;
if (size + src_offset > buffer->size())
return -1;
return buffer->device()->upload(buffer->data(), dev_maddr, size, src_offset);
}
extern int vx_copy_from_dev(vx_buffer_h hbuffer, uint64_t dev_maddr, uint64_t size, uint64_t dest_offset) {
if (nullptr == hbuffer
|| 0 >= size)
return -1;
auto buffer = (vx_buffer*)hbuffer;
if (size + dest_offset > buffer->size())
return -1;
return buffer->device()->download(buffer->data(), dev_maddr, size, dest_offset);
}
extern int vx_start(vx_device_h hdevice) {
if (nullptr == hdevice)
return -1;
vx_device *device = ((vx_device*)hdevice);
return device->start();
}
extern int vx_ready_wait(vx_device_h hdevice, uint64_t timeout) {
if (nullptr == hdevice)
return -1;
vx_device *device = ((vx_device*)hdevice);
return device->wait(timeout);
}

View File

@@ -1,32 +0,0 @@
SIMX_DIR = ../../sim/simx
CXXFLAGS += -std=c++11 -Wall -Wextra -Wfatal-errors
CXXFLAGS += -fPIC -Wno-maybe-uninitialized
CXXFLAGS += -I../include -I../common -I../../hw -I$(SIMX_DIR) -I$(SIMX_DIR)/../common
CXXFLAGS += $(CONFIGS)
CXXFLAGS += -DDUMP_PERF_STATS
LDFLAGS += -shared -pthread
LDFLAGS += -L. -lsimx
SRCS = vortex.cpp ../common/vx_utils.cpp
# Debugigng
ifdef DEBUG
CXXFLAGS += -g -O0
else
CXXFLAGS += -O2 -DNDEBUG
endif
PROJECT = libvortex.so
all: $(PROJECT)
$(PROJECT): $(SRCS)
DESTDIR=../../driver/simx $(MAKE) -C $(SIMX_DIR) ../../driver/simx/libsimx.so
$(CXX) $(CXXFLAGS) $^ $(LDFLAGS) -o $@
clean:
DESTDIR=../../driver/simx $(MAKE) -C $(SIMX_DIR) clean
rm -rf libsimx.so $(PROJECT) *.o

View File

@@ -1,357 +0,0 @@
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <iostream>
#include <future>
#include <chrono>
#include <vortex.h>
#include <vx_utils.h>
#include <vx_malloc.h>
#include <VX_config.h>
#include <util.h>
#include <processor.h>
#include <archdef.h>
#include <mem.h>
#include <constants.h>
using namespace vortex;
///////////////////////////////////////////////////////////////////////////////
class vx_device;
class vx_buffer {
public:
vx_buffer(uint64_t size, vx_device* device)
: size_(size)
, device_(device) {
uint64_t aligned_asize = aligned_size(size, CACHE_BLOCK_SIZE);
data_ = malloc(aligned_asize);
}
~vx_buffer() {
if (data_) {
free(data_);
}
}
void* data() const {
return data_;
}
uint64_t size() const {
return size_;
}
vx_device* device() const {
return device_;
}
private:
uint64_t size_;
vx_device* device_;
void* data_;
};
///////////////////////////////////////////////////////////////////////////////
class vx_device {
public:
vx_device()
: arch_(NUM_CORES * NUM_CLUSTERS, NUM_WARPS, NUM_THREADS)
, ram_(RAM_PAGE_SIZE)
, processor_(arch_)
, mem_allocator_(
ALLOC_BASE_ADDR,
ALLOC_BASE_ADDR + LOCAL_MEM_SIZE,
RAM_PAGE_SIZE,
CACHE_BLOCK_SIZE)
{
// attach memory module
processor_.attach_ram(&ram_);
}
~vx_device() {
if (future_.valid()) {
future_.wait();
}
}
int alloc_local_mem(uint64_t size, uint64_t* dev_maddr) {
return mem_allocator_.allocate(size, dev_maddr);
}
int free_local_mem(uint64_t dev_maddr) {
return mem_allocator_.release(dev_maddr);
}
int upload(const void* src, uint64_t dest_addr, uint64_t size, uint64_t src_offset) {
uint64_t asize = aligned_size(size, CACHE_BLOCK_SIZE);
if (dest_addr + asize > LOCAL_MEM_SIZE)
return -1;
ram_.write((const uint8_t*)src + src_offset, dest_addr, asize);
/*printf("VXDRV: upload %d bytes to 0x%x\n", size, dest_addr);
for (int i = 0; i < size; i += 4) {
printf("mem-write: 0x%x <- 0x%x\n", dest_addr + i, *(uint32_t*)((uint8_t*)src + src_offset + i));
}*/
return 0;
}
int download(void* dest, uint64_t src_addr, uint64_t size, uint64_t dest_offset) {
uint64_t asize = aligned_size(size, CACHE_BLOCK_SIZE);
if (src_addr + asize > LOCAL_MEM_SIZE)
return -1;
ram_.read((uint8_t*)dest + dest_offset, src_addr, asize);
/*printf("VXDRV: download %d bytes from 0x%x\n", size, src_addr);
for (int i = 0; i < size; i += 4) {
printf("mem-read: 0x%x -> 0x%x\n", src_addr + i, *(uint32_t*)((uint8_t*)dest + dest_offset + i));
}*/
return 0;
}
int start() {
// ensure prior run completed
if (future_.valid()) {
future_.wait();
}
// start new run
future_ = std::async(std::launch::async, [&]{
processor_.run();
});
return 0;
}
int wait(uint64_t timeout) {
if (!future_.valid())
return 0;
uint64_t timeout_sec = timeout / 1000;
std::chrono::seconds wait_time(1);
for (;;) {
// wait for 1 sec and check status
auto status = future_.wait_for(wait_time);
if (status == std::future_status::ready
|| 0 == timeout_sec--)
break;
}
return 0;
}
private:
ArchDef arch_;
RAM ram_;
Processor processor_;
MemoryAllocator mem_allocator_;
std::future<void> future_;
};
///////////////////////////////////////////////////////////////////////////////
#ifdef DUMP_PERF_STATS
class AutoPerfDump {
private:
std::list<vx_device_h> devices_;
public:
AutoPerfDump() {}
~AutoPerfDump() {
for (auto device : devices_) {
vx_dump_perf(device, stdout);
}
}
void add_device(vx_device_h device) {
devices_.push_back(device);
}
void remove_device(vx_device_h device) {
devices_.remove(device);
}
};
AutoPerfDump gAutoPerfDump;
#endif
///////////////////////////////////////////////////////////////////////////////
extern int vx_dev_open(vx_device_h* hdevice) {
if (nullptr == hdevice)
return -1;
*hdevice = new vx_device();
#ifdef DUMP_PERF_STATS
gAutoPerfDump.add_device(*hdevice);
#endif
return 0;
}
extern int vx_dev_close(vx_device_h hdevice) {
if (nullptr == hdevice)
return -1;
vx_device *device = ((vx_device*)hdevice);
#ifdef DUMP_PERF_STATS
gAutoPerfDump.remove_device(hdevice);
vx_dump_perf(hdevice, stdout);
#endif
delete device;
return 0;
}
extern int vx_dev_caps(vx_device_h hdevice, uint32_t caps_id, uint64_t *value) {
if (nullptr == hdevice)
return -1;
switch (caps_id) {
case VX_CAPS_VERSION:
*value = IMPLEMENTATION_ID;
break;
case VX_CAPS_MAX_CORES:
*value = NUM_CORES * NUM_CLUSTERS;
break;
case VX_CAPS_MAX_WARPS:
*value = NUM_WARPS;
break;
case VX_CAPS_MAX_THREADS:
*value = NUM_THREADS;
break;
case VX_CAPS_CACHE_LINE_SIZE:
*value = CACHE_BLOCK_SIZE;
break;
case VX_CAPS_LOCAL_MEM_SIZE:
*value = LOCAL_MEM_SIZE;
break;
case VX_CAPS_ALLOC_BASE_ADDR:
*value = ALLOC_BASE_ADDR;
break;
case VX_CAPS_KERNEL_BASE_ADDR:
*value = STARTUP_ADDR;
break;
default:
std::cout << "invalid caps id: " << caps_id << std::endl;
std::abort();
return -1;
}
return 0;
}
extern int vx_mem_alloc(vx_device_h hdevice, uint64_t size, uint64_t* dev_maddr) {
if (nullptr == hdevice
|| nullptr == dev_maddr
|| 0 >= size)
return -1;
vx_device *device = ((vx_device*)hdevice);
return device->alloc_local_mem(size, dev_maddr);
}
extern int vx_mem_free(vx_device_h hdevice, uint64_t dev_maddr) {
if (nullptr == hdevice)
return -1;
vx_device *device = ((vx_device*)hdevice);
return device->free_local_mem(dev_maddr);
}
extern int vx_buf_alloc(vx_device_h hdevice, uint64_t size, vx_buffer_h* hbuffer) {
if (nullptr == hdevice
|| 0 >= size
|| nullptr == hbuffer)
return -1;
vx_device *device = ((vx_device*)hdevice);
auto buffer = new vx_buffer(size, device);
if (nullptr == buffer->data()) {
delete buffer;
return -1;
}
*hbuffer = buffer;
return 0;
}
extern void* vx_host_ptr(vx_buffer_h hbuffer) {
if (nullptr == hbuffer)
return nullptr;
vx_buffer* buffer = ((vx_buffer*)hbuffer);
return buffer->data();
}
extern int vx_buf_free(vx_buffer_h hbuffer) {
if (nullptr == hbuffer)
return -1;
vx_buffer* buffer = ((vx_buffer*)hbuffer);
delete buffer;
return 0;
}
extern int vx_copy_to_dev(vx_buffer_h hbuffer, uint64_t dev_maddr, uint64_t size, uint64_t src_offset) {
if (nullptr == hbuffer
|| 0 >= size)
return -1;
auto buffer = (vx_buffer*)hbuffer;
if (size + src_offset > buffer->size())
return -1;
return buffer->device()->upload(buffer->data(), dev_maddr, size, src_offset);
}
extern int vx_copy_from_dev(vx_buffer_h hbuffer, uint64_t dev_maddr, uint64_t size, uint64_t dest_offset) {
if (nullptr == hbuffer
|| 0 >= size)
return -1;
auto buffer = (vx_buffer*)hbuffer;
if (size + dest_offset > buffer->size())
return -1;
return buffer->device()->download(buffer->data(), dev_maddr, size, dest_offset);
}
extern int vx_start(vx_device_h hdevice) {
if (nullptr == hdevice)
return -1;
vx_device *device = ((vx_device*)hdevice);
return device->start();
}
extern int vx_ready_wait(vx_device_h hdevice, uint64_t timeout) {
if (nullptr == hdevice)
return -1;
vx_device *device = ((vx_device*)hdevice);
return device->wait(timeout);
}

View File

@@ -1,49 +0,0 @@
#include <vortex.h>
extern int vx_dev_open(vx_device_h* /*hdevice*/) {
return -1;
}
extern int vx_dev_close(vx_device_h /*hdevice*/) {
return -1;
}
extern int vx_dev_caps(vx_device_h /*hdevice*/, uint32_t /*caps_id*/, uint64_t* /*value*/) {
return -1;
}
extern int vx_mem_alloc(vx_device_h /*hdevice*/, uint64_t /*size*/, uint64_t* /*dev_maddr*/) {
return -1;
}
int vx_mem_free(vx_device_h /*hdevice*/, uint64_t /*dev_maddr*/) {
return -1;
}
extern int vx_buf_alloc(vx_device_h /*hdevice*/, uint64_t /*size*/, vx_buffer_h* /*hbuffer*/) {
return -1;
}
extern void* vx_host_ptr(vx_buffer_h /*hbuffer*/) {
return nullptr;
}
extern int vx_buf_free(vx_buffer_h /*hbuffer*/) {
return -1;
}
extern int vx_copy_to_dev(vx_buffer_h /*hbuffer*/, uint64_t /*dev_maddr*/, uint64_t /*size*/, uint64_t /*src_offset*/) {
return -1;
}
extern int vx_copy_from_dev(vx_buffer_h /*hbuffer*/, uint64_t /*dev_maddr*/, uint64_t /*size*/, uint64_t /*dest_offset*/) {
return -1;
}
extern int vx_start(vx_device_h /*hdevice*/) {
return -1;
}
extern int vx_ready_wait(vx_device_h /*hdevice*/, uint64_t /*timeout*/) {
return -1;
}

View File

@@ -1,60 +0,0 @@
VLSIM_DIR = ../../sim/vlsim
RTL_DIR=../../hw/rtl
SCRIPT_DIR=../../hw/scripts
CXXFLAGS += -std=c++11 -Wall -Wextra -pedantic -Wfatal-errors
CXXFLAGS += -I. -I../include -I../../hw -I$(VLSIM_DIR)
# Position independent code
CXXFLAGS += -fPIC
# Add external configuration
CXXFLAGS += $(CONFIGS)
# Dump perf stats
CXXFLAGS += -DDUMP_PERF_STATS
LDFLAGS += -shared -pthread
LDFLAGS += -L. -lopae-c-vlsim
SRCS = ../common/opae.cpp ../common/vx_utils.cpp
# Debugigng
ifdef DEBUG
CXXFLAGS += -g -O0
else
CXXFLAGS += -O2 -DNDEBUG
endif
# Enable scope analyzer
ifdef SCOPE
CXXFLAGS += -DSCOPE
SRCS += ../common/vx_scope.cpp
SCOPE_H = scope-defs.h
endif
# Enable perf counters
ifdef PERF
CXXFLAGS += -DPERF_ENABLE
endif
PROJECT = libvortex.so
all: $(PROJECT)
scope-defs.h: $(SCRIPT_DIR)/scope.json
$(SCRIPT_DIR)/scope.py $(CONFIGS) -cc scope-defs.h -vl $(RTL_DIR)/scope-defs.vh $(SCRIPT_DIR)/scope.json
# generate scope data
scope: scope-defs.h
$(PROJECT): $(SRCS) $(SCOPE_H)
DESTDIR=../../driver/vlsim $(MAKE) -C $(VLSIM_DIR) ../../driver/vlsim/libopae-c-vlsim.so
$(CXX) $(CXXFLAGS) -DUSE_VLSIM $(SRCS) $(LDFLAGS) -o $(PROJECT)
clean:
DESTDIR=../../driver/vlsim $(MAKE) -C $(VLSIM_DIR) clean
rm -rf libopae-c-vlsim.so $(PROJECT) *.o scope-defs.h

3
hw/.gitignore vendored
View File

@@ -1 +1,2 @@
obj_dir/*
VX_config.h
VX_types.h

View File

@@ -1,12 +1,17 @@
RTL_DIR=./rtl
SCRIPT_DIR=./scripts
all: VX_config.h
all: config
config: VX_config.h VX_types.h
VX_config.h: $(RTL_DIR)/VX_config.vh
$(SCRIPT_DIR)/gen_config.py -i $(RTL_DIR)/VX_config.vh -o VX_config.h
clean:
rm -f VX_config.h
VX_types.h: $(RTL_DIR)/VX_types.vh
$(SCRIPT_DIR)/gen_config.py -i $(RTL_DIR)/VX_types.vh -o VX_types.h
.PHONY: VX_config.h
clean:
rm -f VX_config.h VX_types.h
.PHONY: VX_config.h VX_types.h

684
hw/VX_config.h Normal file
View File

@@ -0,0 +1,684 @@
// auto-generated by gen_config.py. DO NOT EDIT
// Generated at 2024-05-07 13:55:58.398687
// Translated from ./rtl/VX_config.vh:
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#ifndef VX_CONFIG_VH
#define VX_CONFIG_VH
#ifndef MIN
#define MIN(x, y) (((x) < (y)) ? (x) : (y))
#endif
#ifndef MAX
#define MAX(x, y) (((x) > (y)) ? (x) : (y))
#endif
#ifndef CLAMP
#define CLAMP(x, lo, hi) (((x) > (hi)) ? (hi) : (((x) < (lo)) ? (lo) : (x)))
#endif
#ifndef UP
#define UP(x) (((x) != 0) ? (x) : 1)
#endif
///////////////////////////////////////////////////////////////////////////////
#ifndef EXT_M_DISABLE
#define EXT_M_ENABLE
#endif
#ifndef EXT_F_DISABLE
#define EXT_F_ENABLE
#endif
#ifndef XLEN_32
#ifndef XLEN_64
#define XLEN_32
#endif
#endif
#ifdef XLEN_64
#define XLEN 64
#endif
#ifdef XLEN_32
#define XLEN 32
#endif
#ifdef EXT_D_ENABLE
#define FLEN_64
#else
#define FLEN_32
#endif
#ifdef FLEN_64
#define FLEN 64
#endif
#ifdef FLEN_32
#define FLEN 32
#endif
#ifdef XLEN_64
#ifdef FLEN_32
#define FPU_RV64F
#endif
#endif
#ifndef NUM_CLUSTERS
#define NUM_CLUSTERS 1
#endif
#ifndef NUM_CORES
#define NUM_CORES 1
#endif
#ifndef NUM_WARPS
#define NUM_WARPS 4
#endif
#ifndef NUM_THREADS
#define NUM_THREADS 4
#endif
#ifndef NUM_BARRIERS
#define NUM_BARRIERS 8
#endif
#ifndef SOCKET_SIZE
#define SOCKET_SIZE MIN(4, NUM_CORES)
#endif
#define NUM_SOCKETS UP(NUM_CORES / SOCKET_SIZE)
#ifdef L2_ENABLE
#define L2_ENABLED 1
#else
#define L2_ENABLED 0
#endif
#ifdef L3_ENABLE
#define L3_ENABLED 1
#else
#define L3_ENABLED 0
#endif
#ifdef L1_DISABLE
#define ICACHE_DISABLE
#define DCACHE_DISABLE
#endif
#ifndef MEM_BLOCK_SIZE
#define MEM_BLOCK_SIZE 64
#endif
#ifndef MEM_ADDR_WIDTH
#ifdef XLEN_64
#define MEM_ADDR_WIDTH 48
#else
#define MEM_ADDR_WIDTH 32
#endif
#endif
#ifndef L1_LINE_SIZE
#ifdef L1_DISABLE
#define L1_LINE_SIZE ((L2_ENABLED || L3_ENABLED) ? 4 : MEM_BLOCK_SIZE)
#else
#define L1_LINE_SIZE ((L2_ENABLED || L3_ENABLED) ? 16 : MEM_BLOCK_SIZE)
#endif
#endif
#ifdef L2_ENABLE
#define L2_LINE_SIZE MEM_BLOCK_SIZE
#else
#define L2_LINE_SIZE L1_LINE_SIZE
#endif
#ifdef L3_ENABLE
#define L3_LINE_SIZE MEM_BLOCK_SIZE
#else
#define L3_LINE_SIZE L2_LINE_SIZE
#endif
#ifdef XLEN_64
#ifndef STARTUP_ADDR
#define STARTUP_ADDR 0x180000000
#endif
#ifndef STACK_BASE_ADDR
#define STACK_BASE_ADDR 0x1FF000000
#endif
#else
#ifndef STARTUP_ADDR
#define STARTUP_ADDR 0x80000000
#endif
#ifndef STACK_BASE_ADDR
#define STACK_BASE_ADDR 0xFF000000
#endif
#endif
#ifndef SMEM_BASE_ADDR
#define SMEM_BASE_ADDR STACK_BASE_ADDR
#endif
#ifndef SMEM_LOG_SIZE
#define SMEM_LOG_SIZE 17
#endif
#ifndef IO_BASE_ADDR
#define IO_BASE_ADDR (SMEM_BASE_ADDR + (1 << SMEM_LOG_SIZE))
#endif
#ifndef IO_COUT_ADDR
#define IO_COUT_ADDR IO_BASE_ADDR
#endif
#define IO_COUT_SIZE MEM_BLOCK_SIZE
#ifndef IO_CSR_ADDR
#define IO_CSR_ADDR (IO_COUT_ADDR + IO_COUT_SIZE)
#endif
#define IO_CSR_SIZE (4 * 64 * NUM_CORES * NUM_CLUSTERS)
#ifndef STACK_LOG2_SIZE
#define STACK_LOG2_SIZE 13
#endif
#define STACK_SIZE (1 << STACK_LOG2_SIZE)
#define RESET_DELAY 8
#ifndef STALL_TIMEOUT
#define STALL_TIMEOUT (100000 * (1 ** (L2_ENABLED + L3_ENABLED)))
#endif
#ifndef SV_DPI
#define DPI_DISABLE
#endif
#ifndef FPU_FPNEW
#ifndef FPU_DSP
#ifndef FPU_DPI
#ifndef SYNTHESIS
#ifndef DPI_DISABLE
#define FPU_DPI
#else
#define FPU_DSP
#endif
#else
#define FPU_DSP
#endif
#endif
#endif
#endif
#ifndef SYNTHESIS
#ifndef DPI_DISABLE
#define IMUL_DPI
#define IDIV_DPI
#endif
#endif
#ifndef DEBUG_LEVEL
#define DEBUG_LEVEL 3
#endif
// Pipeline Configuration /////////////////////////////////////////////////////
// Issue width
#ifndef ISSUE_WIDTH
#define ISSUE_WIDTH NUM_WARPS
#endif
// Number of ALU units
#ifndef NUM_ALU_LANES
#define NUM_ALU_LANES NUM_THREADS
#endif
#ifndef NUM_ALU_BLOCKS
#define NUM_ALU_BLOCKS 4
#endif
// Number of FPU units
#ifndef NUM_FPU_LANES
#define NUM_FPU_LANES NUM_THREADS
#endif
#ifndef NUM_FPU_BLOCKS
#define NUM_FPU_BLOCKS 2
#endif
// Number of LSU units
#ifndef NUM_LSU_LANES
#define NUM_LSU_LANES NUM_THREADS
#endif
// Number of SFU units
#ifndef NUM_SFU_LANES
#define NUM_SFU_LANES MIN(NUM_THREADS, 4)
#endif
// Size of Instruction Buffer
#ifndef IBUF_SIZE
#define IBUF_SIZE (4 * ISSUE_WIDTH)
#endif
// Size of LSU Request Queue
#ifndef LSUQ_SIZE
#define LSUQ_SIZE (4 * NUM_WARPS * (NUM_THREADS / NUM_LSU_LANES))
#endif
// LSU Duplicate Address Check
#ifndef LSU_DUP_DISABLE
#define LSU_DUP_ENABLE
#endif
#ifdef LSU_DUP_ENABLE
#define LSU_DUP_ENABLED 1
#else
#define LSU_DUP_ENABLED 0
#endif
#ifdef GBAR_ENABLE
#define GBAR_ENABLED 1
#else
#define GBAR_ENABLED 0
#endif
#ifndef LATENCY_IMUL
#ifdef VIVADO
#define LATENCY_IMUL 4
#endif
#ifdef QUARTUS
#define LATENCY_IMUL 3
#endif
#ifndef LATENCY_IMUL
#define LATENCY_IMUL 4
#endif
#endif
// Floating-Point Units ///////////////////////////////////////////////////////
// Size of FPU Request Queue
#ifndef FPUQ_SIZE
#define FPUQ_SIZE (2 * (NUM_THREADS / NUM_FPU_LANES))
#endif
// FNCP Latency
#ifndef LATENCY_FNCP
#define LATENCY_FNCP 2
#endif
// FMA Latency
#ifndef LATENCY_FMA
#ifdef FPU_DPI
#define LATENCY_FMA 4
#endif
#ifdef FPU_FPNEW
#define LATENCY_FMA 4
#endif
#ifdef FPU_DSP
#ifdef QUARTUS
#define LATENCY_FMA 4
#endif
#ifdef VIVADO
#define LATENCY_FMA 16
#endif
#ifndef LATENCY_FMA
#define LATENCY_FMA 4
#endif
#endif
#endif
// FDIV Latency
#ifndef LATENCY_FDIV
#ifdef FPU_DPI
#define LATENCY_FDIV 15
#endif
#ifdef FPU_FPNEW
#define LATENCY_FDIV 16
#endif
#ifdef FPU_DSP
#ifdef QUARTUS
#define LATENCY_FDIV 15
#endif
#ifdef VIVADO
#define LATENCY_FDIV 28
#endif
#ifndef LATENCY_FDIV
#define LATENCY_FDIV 16
#endif
#endif
#endif
// FSQRT Latency
#ifndef LATENCY_FSQRT
#ifdef FPU_DPI
#define LATENCY_FSQRT 10
#endif
#ifdef FPU_FPNEW
#define LATENCY_FSQRT 16
#endif
#ifdef FPU_DSP
#ifdef QUARTUS
#define LATENCY_FSQRT 10
#endif
#ifdef VIVADO
#define LATENCY_FSQRT 28
#endif
#ifndef LATENCY_FSQRT
#define LATENCY_FSQRT 16
#endif
#endif
#endif
// FCVT Latency
#ifndef LATENCY_FCVT
#define LATENCY_FCVT 5
#endif
// Icache Configurable Knobs //////////////////////////////////////////////////
// Cache Enable
#ifndef ICACHE_DISABLE
#define ICACHE_ENABLE
#endif
#ifdef ICACHE_ENABLE
#define ICACHE_ENABLED 1
#else
#define ICACHE_ENABLED 0
#define NUM_ICACHES 0
#endif
// Number of Cache Units
#ifndef NUM_ICACHES
#define NUM_ICACHES UP(SOCKET_SIZE / 4)
#endif
// Cache Size
#ifndef ICACHE_SIZE
#define ICACHE_SIZE 16384
#endif
// Core Response Queue Size
#ifndef ICACHE_CRSQ_SIZE
#define ICACHE_CRSQ_SIZE 2
#endif
// Miss Handling Register Size
#ifndef ICACHE_MSHR_SIZE
#define ICACHE_MSHR_SIZE 16
#endif
// Memory Request Queue Size
#ifndef ICACHE_MREQ_SIZE
#define ICACHE_MREQ_SIZE 4
#endif
// Memory Response Queue Size
#ifndef ICACHE_MRSQ_SIZE
#define ICACHE_MRSQ_SIZE 0
#endif
// Number of Associative Ways
#ifndef ICACHE_NUM_WAYS
#define ICACHE_NUM_WAYS 1
#endif
// Dcache Configurable Knobs //////////////////////////////////////////////////
// Cache Enable
#ifndef DCACHE_DISABLE
#define DCACHE_ENABLE
#endif
#ifdef DCACHE_ENABLE
#define DCACHE_ENABLED 1
#else
#define DCACHE_ENABLED 0
#define NUM_DCACHES 0
#define DCACHE_NUM_BANKS 1
#endif
// Number of Cache Units
#ifndef NUM_DCACHES
#define NUM_DCACHES UP(SOCKET_SIZE / 4)
#endif
// Cache Size
#ifndef DCACHE_SIZE
#define DCACHE_SIZE 16384
#endif
// Number of Banks
#ifndef DCACHE_NUM_BANKS
#define DCACHE_NUM_BANKS NUM_LSU_LANES
#endif
// Core Response Queue Size
#ifndef DCACHE_CRSQ_SIZE
#define DCACHE_CRSQ_SIZE 2
#endif
// Miss Handling Register Size
#ifndef DCACHE_MSHR_SIZE
#define DCACHE_MSHR_SIZE 8
#endif
// Memory Request Queue Size
#ifndef DCACHE_MREQ_SIZE
#define DCACHE_MREQ_SIZE 4
#endif
// Memory Response Queue Size
#ifndef DCACHE_MRSQ_SIZE
#define DCACHE_MRSQ_SIZE 0
#endif
// Number of Associative Ways
#ifndef DCACHE_NUM_WAYS
#define DCACHE_NUM_WAYS 1
#endif
// SM Configurable Knobs //////////////////////////////////////////////////////
#ifndef SM_DISABLE
#define SM_ENABLE
#endif
#ifdef SM_ENABLE
#define SM_ENABLED 1
#else
#define SM_ENABLED 0
#define SMEM_NUM_BANKS 1
#endif
// Number of Banks
#ifndef SMEM_NUM_BANKS
#define SMEM_NUM_BANKS (NUM_LSU_LANES)
#endif
// L2cache Configurable Knobs /////////////////////////////////////////////////
// Cache Size
#ifndef L2_CACHE_SIZE
#ifdef ALTERA_S10
#define L2_CACHE_SIZE 2097152
#else
#define L2_CACHE_SIZE 1048576
#endif
#endif
// Number of Banks
#ifndef L2_NUM_BANKS
#define L2_NUM_BANKS MIN(4, NUM_SOCKETS)
#endif
// Core Response Queue Size
#ifndef L2_CRSQ_SIZE
#define L2_CRSQ_SIZE 2
#endif
// Miss Handling Register Size
#ifndef L2_MSHR_SIZE
#define L2_MSHR_SIZE 16
#endif
// Memory Request Queue Size
#ifndef L2_MREQ_SIZE
#define L2_MREQ_SIZE 4
#endif
// Memory Response Queue Size
#ifndef L2_MRSQ_SIZE
#define L2_MRSQ_SIZE 0
#endif
// Number of Associative Ways
#ifndef L2_NUM_WAYS
#define L2_NUM_WAYS 2
#endif
// L3cache Configurable Knobs /////////////////////////////////////////////////
// Cache Size
#ifndef L3_CACHE_SIZE
#ifdef ALTERA_S10
#define L3_CACHE_SIZE 2097152
#else
#define L3_CACHE_SIZE 1048576
#endif
#endif
// Number of Banks
#ifndef L3_NUM_BANKS
#define L3_NUM_BANKS MIN(4, NUM_CLUSTERS)
#endif
// Core Response Queue Size
#ifndef L3_CRSQ_SIZE
#define L3_CRSQ_SIZE 2
#endif
// Miss Handling Register Size
#ifndef L3_MSHR_SIZE
#define L3_MSHR_SIZE 16
#endif
// Memory Request Queue Size
#ifndef L3_MREQ_SIZE
#define L3_MREQ_SIZE 4
#endif
// Memory Response Queue Size
#ifndef L3_MRSQ_SIZE
#define L3_MRSQ_SIZE 0
#endif
// Number of Associative Ways
#ifndef L3_NUM_WAYS
#define L3_NUM_WAYS 4
#endif
// ISA Extensions /////////////////////////////////////////////////////////////
#ifdef EXT_A_ENABLE
#define EXT_A_ENABLED 1
#else
#define EXT_A_ENABLED 0
#endif
#ifdef EXT_C_ENABLE
#define EXT_C_ENABLED 1
#else
#define EXT_C_ENABLED 0
#endif
#ifdef EXT_D_ENABLE
#define EXT_D_ENABLED 1
#else
#define EXT_D_ENABLED 0
#endif
#ifdef EXT_F_ENABLE
#define EXT_F_ENABLED 1
#else
#define EXT_F_ENABLED 0
#endif
#ifdef EXT_M_ENABLE
#define EXT_M_ENABLED 1
#else
#define EXT_M_ENABLED 0
#endif
#define ISA_STD_A 0
#define ISA_STD_C 2
#define ISA_STD_D 3
#define ISA_STD_E 4
#define ISA_STD_F 5
#define ISA_STD_H 7
#define ISA_STD_I 8
#define ISA_STD_N 13
#define ISA_STD_Q 16
#define ISA_STD_S 18
#define ISA_STD_U 20
#define ISA_EXT_ICACHE 0
#define ISA_EXT_DCACHE 1
#define ISA_EXT_L2CACHE 2
#define ISA_EXT_L3CACHE 3
#define ISA_EXT_SMEM 4
#define MISA_EXT (ICACHE_ENABLED << ISA_EXT_ICACHE) \
| (DCACHE_ENABLED << ISA_EXT_DCACHE) \
| (L2_ENABLED << ISA_EXT_L2CACHE) \
| (L3_ENABLED << ISA_EXT_L3CACHE) \
| (SM_ENABLED << ISA_EXT_SMEM)
#define MISA_STD (EXT_A_ENABLED << 0) /* A - Atomic Instructions extension */ \
| (0 << 1) /* B - Tentatively reserved for Bit operations extension */ \
| (EXT_C_ENABLED << 2) /* C - Compressed extension */ \
| (EXT_D_ENABLED << 3) /* D - Double precsision floating-point extension */ \
| (0 << 4) /* E - RV32E base ISA */ \
| (EXT_F_ENABLED << 5) /* F - Single precsision floating-point extension */ \
| (0 << 6) /* G - Additional standard extensions present */ \
| (0 << 7) /* H - Hypervisor mode implemented */ \
| (1 << 8) /* I - RV32I/64I/128I base ISA */ \
| (0 << 9) /* J - Reserved */ \
| (0 << 10) /* K - Reserved */ \
| (0 << 11) /* L - Tentatively reserved for Bit operations extension */ \
| (EXT_M_ENABLED << 12) /* M - Integer Multiply/Divide extension */ \
| (0 << 13) /* N - User level interrupts supported */ \
| (0 << 14) /* O - Reserved */ \
| (0 << 15) /* P - Tentatively reserved for Packed-SIMD extension */ \
| (0 << 16) /* Q - Quad-precision floating-point extension */ \
| (0 << 17) /* R - Reserved */ \
| (0 << 18) /* S - Supervisor mode implemented */ \
| (0 << 19) /* T - Tentatively reserved for Transactional Memory extension */ \
| (1 << 20) /* U - User mode implemented */ \
| (0 << 21) /* V - Tentatively reserved for Vector extension */ \
| (0 << 22) /* W - Reserved */ \
| (1 << 23) /* X - Non-standard extensions present */ \
| (0 << 24) /* Y - Reserved */ \
| (0 << 25) /* Z - Reserved */
// Device identification //////////////////////////////////////////////////////
#define VENDOR_ID 0
#define ARCHITECTURE_ID 0
#define IMPLEMENTATION_ID 0
#endif // VX_CONFIG_VH

View File

@@ -1,3 +1,16 @@
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <stdio.h>
#include <math.h>
#include <unordered_map>
@@ -5,167 +18,563 @@
#include <mutex>
#include <iostream>
#include <rvfloats.h>
#include <util.h>
#include "svdpi.h"
#include "verilated_vpi.h"
// #include "verilated_vpi.h"
#include "VX_config.h"
#include <bit>
#include "half.h"
extern "C" {
void dpi_fadd(bool enable, int a, int b, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_fsub(bool enable, int a, int b, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_fmul(bool enable, int a, int b, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_fmadd(bool enable, int a, int b, int c, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_fmsub(bool enable, int a, int b, int c, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_fnmadd(bool enable, int a, int b, int c, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_fnmsub(bool enable, int a, int b, int c, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_fadd(bool enable, int dst_fmt, int64_t a, int64_t b, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_fsub(bool enable, int dst_fmt, int64_t a, int64_t b, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_fmul(bool enable, int dst_fmt, int64_t a, int64_t b, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_fmadd(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t c, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_fmsub(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t c, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_fnmadd(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t c, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_fnmsub(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t c, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_fdiv(bool enable, int a, int b, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_fsqrt(bool enable, int a, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_fdiv(bool enable, int dst_fmt, int64_t a, int64_t b, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_fsqrt(bool enable, int dst_fmt, int64_t a, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_ftoi(bool enable, int dst_fmt, int src_fmt, int64_t a, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_ftou(bool enable, int dst_fmt, int src_fmt, int64_t a, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_itof(bool enable, int dst_fmt, int src_fmt, int64_t a, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_utof(bool enable, int dst_fmt, int src_fmt, int64_t a, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags);
void dpi_f2f(bool enable, int dst_fmt, int64_t a, int64_t* result);
void dpi_ftoi(bool enable, int a, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_ftou(bool enable, int a, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_itof(bool enable, int a, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_utof(bool enable, int a, const svBitVecVal* frm, int* result, svBitVecVal* fflags);
void dpi_fclss(bool enable, int dst_fmt, int64_t a, int64_t* result);
void dpi_fsgnj(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result);
void dpi_fsgnjn(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result);
void dpi_fsgnjx(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result);
void dpi_fclss(bool enable, int a, int* result);
void dpi_fsgnj(bool enable, int a, int b, int* result);
void dpi_fsgnjn(bool enable, int a, int b, int* result);
void dpi_fsgnjx(bool enable, int a, int b, int* result);
void dpi_flt(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result, svBitVecVal* fflags);
void dpi_fle(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result, svBitVecVal* fflags);
void dpi_feq(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result, svBitVecVal* fflags);
void dpi_fmin(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result, svBitVecVal* fflags);
void dpi_fmax(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result, svBitVecVal* fflags);
void dpi_flt(bool enable, int a, int b, int* result, svBitVecVal* fflags);
void dpi_fle(bool enable, int a, int b, int* result, svBitVecVal* fflags);
void dpi_feq(bool enable, int a, int b, int* result, svBitVecVal* fflags);
void dpi_fmin(bool enable, int a, int b, int* result, svBitVecVal* fflags);
void dpi_fmax(bool enable, int a, int b, int* result, svBitVecVal* fflags);
void dpi_hmma(bool enable, const svBitVecVal* A_tile, const svBitVecVal* B_tile, const svBitVecVal* C_tile, svBitVecVal* D_tile);
void dpi_print_results(int wid, int octet, const svBitVecVal* A_tile, const svBitVecVal* B_tile, const svBitVecVal* C_tile, const svBitVecVal* D_tile);
}
void dpi_fadd(bool enable, int a, int b, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_fadd_s(a, b, (*frm & 0x7), fflags);
inline uint64_t nan_box(uint32_t value) {
#ifdef FPU_RV64F
return value | 0xffffffff00000000;
#else
return value;
#endif
}
void dpi_fsub(bool enable, int a, int b, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_fsub_s(a, b, (*frm & 0x7), fflags);
inline bool is_nan_boxed(uint64_t value) {
#ifdef FPU_RV64F
return (uint32_t(value >> 32) == 0xffffffff);
#else
__unused (value);
return true;
#endif
}
void dpi_fmul(bool enable, int a, int b, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_fmul_s(a, b, (*frm & 0x7), fflags);
inline int64_t check_boxing(int64_t a) {
if (!is_nan_boxed(a)) {
return nan_box(0x7fc00000); // NaN
}
return a;
}
void dpi_fmadd(bool enable, int a, int b, int c, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
void dpi_fadd(bool enable, int dst_fmt, int64_t a, int64_t b, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_fmadd_s(a, b, c, (*frm & 0x7), fflags);
if (dst_fmt) {
*result = rv_fadd_d(a, b, (*frm & 0x7), fflags);
} else {
*result = nan_box(rv_fadd_s(check_boxing(a), check_boxing(b), (*frm & 0x7), fflags));
}
}
void dpi_fmsub(bool enable, int a, int b, int c, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
void dpi_fsub(bool enable, int dst_fmt, int64_t a, int64_t b, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_fmsub_s(a, b, c, (*frm & 0x7), fflags);
if (dst_fmt) {
*result = rv_fsub_d(a, b, (*frm & 0x7), fflags);
} else {
*result = nan_box(rv_fsub_s(check_boxing(a), check_boxing(b), (*frm & 0x7), fflags));
}
}
void dpi_fnmadd(bool enable, int a, int b, int c, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
void dpi_fmul(bool enable, int dst_fmt, int64_t a, int64_t b, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_fnmadd_s(a, b, c, (*frm & 0x7), fflags);
if (dst_fmt) {
*result = rv_fmul_d(a, b, (*frm & 0x7), fflags);
} else {
*result = nan_box(rv_fmul_s(check_boxing(a), check_boxing(b), (*frm & 0x7), fflags));
}
}
void dpi_fnmsub(bool enable, int a, int b, int c, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
void dpi_fmadd(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t c, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_fnmsub_s(a, b, c, (*frm & 0x7), fflags);
if (dst_fmt) {
*result = rv_fmadd_d(a, b, c, (*frm & 0x7), fflags);
} else {
*result = nan_box(rv_fmadd_s(check_boxing(a), check_boxing(b), check_boxing(c), (*frm & 0x7), fflags));
}
}
void dpi_fdiv(bool enable, int a, int b, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
void dpi_fmsub(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t c, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_fdiv_s(a, b, (*frm & 0x7), fflags);
if (dst_fmt) {
*result = rv_fmsub_d(a, b, c, (*frm & 0x7), fflags);
} else {
*result = nan_box(rv_fmsub_s(check_boxing(a), check_boxing(b), check_boxing(c), (*frm & 0x7), fflags));
}
}
void dpi_fsqrt(bool enable, int a, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
void dpi_fnmadd(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t c, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_fsqrt_s(a, (*frm & 0x7), fflags);
if (dst_fmt) {
*result = rv_fnmadd_d(a, b, c, (*frm & 0x7), fflags);
} else {
*result = nan_box(rv_fnmadd_s(check_boxing(a), check_boxing(b), check_boxing(c), (*frm & 0x7), fflags));
}
}
void dpi_ftoi(bool enable, int a, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
void dpi_fnmsub(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t c, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_ftoi_s(a, (*frm & 0x7), fflags);
if (dst_fmt) {
*result = rv_fnmsub_d(a, b, c, (*frm & 0x7), fflags);
} else {
*result = nan_box(rv_fnmsub_s(check_boxing(a), check_boxing(b), check_boxing(c), (*frm & 0x7), fflags));
}
}
void dpi_ftou(bool enable, int a, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
void dpi_fdiv(bool enable, int dst_fmt, int64_t a, int64_t b, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_ftou_s(a, (*frm & 0x7), fflags);
if (dst_fmt) {
*result = rv_fdiv_d(a, b, (*frm & 0x7), fflags);
} else {
*result = nan_box(rv_fdiv_s(check_boxing(a), check_boxing(b), (*frm & 0x7), fflags));
}
}
void dpi_itof(bool enable, int a, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
void dpi_fsqrt(bool enable, int dst_fmt, int64_t a, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_itof_s(a, (*frm & 0x7), fflags);
if (dst_fmt) {
*result = rv_fsqrt_d(a, (*frm & 0x7), fflags);
} else {
*result = nan_box(rv_fsqrt_s(check_boxing(a), (*frm & 0x7), fflags));
}
}
void dpi_utof(bool enable, int a, const svBitVecVal* frm, int* result, svBitVecVal* fflags) {
void dpi_ftoi(bool enable, int dst_fmt, int src_fmt, int64_t a, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_utof_s(a, (*frm & 0x7), fflags);
if (dst_fmt) {
if (src_fmt) {
*result = rv_ftol_d(a, (*frm & 0x7), fflags);
} else {
*result = rv_ftol_s(check_boxing(a), (*frm & 0x7), fflags);
}
} else {
if (src_fmt) {
*result = sext<uint64_t>(rv_ftoi_d(a, (*frm & 0x7), fflags), 32);
} else {
*result = sext<uint64_t>(rv_ftoi_s(check_boxing(a), (*frm & 0x7), fflags), 32);
}
}
}
void dpi_flt(bool enable, int a, int b, int* result, svBitVecVal* fflags) {
void dpi_ftou(bool enable, int dst_fmt, int src_fmt, int64_t a, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_flt_s(a, b, fflags);
if (dst_fmt) {
if (src_fmt) {
*result = rv_ftolu_d(a, (*frm & 0x7), fflags);
} else {
*result = rv_ftolu_s(check_boxing(a), (*frm & 0x7), fflags);
}
} else {
if (src_fmt) {
*result = sext<uint64_t>(rv_ftou_d(a, (*frm & 0x7), fflags), 32);
} else {
*result = sext<uint64_t>(rv_ftou_s(check_boxing(a), (*frm & 0x7), fflags), 32);
}
}
}
void dpi_fle(bool enable, int a, int b, int* result, svBitVecVal* fflags) {
void dpi_itof(bool enable, int dst_fmt, int src_fmt, int64_t a, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_fle_s(a, b, fflags);
if (dst_fmt) {
if (src_fmt) {
*result = rv_ltof_d(a, (*frm & 0x7), fflags);
} else {
*result = rv_itof_d(a, (*frm & 0x7), fflags);
}
} else {
if (src_fmt) {
*result = nan_box(rv_ltof_s(a, (*frm & 0x7), fflags));
} else {
*result = nan_box(rv_itof_s(a, (*frm & 0x7), fflags));
}
}
}
void dpi_feq(bool enable, int a, int b, int* result, svBitVecVal* fflags) {
void dpi_utof(bool enable, int dst_fmt, int src_fmt, int64_t a, const svBitVecVal* frm, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_feq_s(a, b, fflags);
if (dst_fmt) {
if (src_fmt) {
*result = rv_lutof_d(a, (*frm & 0x7), fflags);
} else {
*result = rv_utof_d(a, (*frm & 0x7), fflags);
}
} else {
if (src_fmt) {
*result = nan_box(rv_lutof_s(a, (*frm & 0x7), fflags));
} else {
*result = nan_box(rv_utof_s(a, (*frm & 0x7), fflags));
}
}
}
void dpi_fmin(bool enable, int a, int b, int* result, svBitVecVal* fflags) {
void dpi_f2f(bool enable, int dst_fmt, int64_t a, int64_t* result) {
if (!enable)
return;
*result = rv_fmin_s(a, b, fflags);
if (dst_fmt) {
*result = rv_ftod((int32_t)check_boxing(a));
} else {
*result = nan_box(rv_dtof(a));
}
}
void dpi_fmax(bool enable, int a, int b, int* result, svBitVecVal* fflags) {
void dpi_fclss(bool enable, int dst_fmt, int64_t a, int64_t* result) {
if (!enable)
return;
*result = rv_fmax_s(a, b, fflags);
if (dst_fmt) {
*result = rv_fclss_d(a);
} else {
*result = rv_fclss_s(check_boxing(a));
}
}
void dpi_fclss(bool enable, int a, int* result) {
void dpi_fsgnj(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result) {
if (!enable)
return;
*result = rv_fclss_s(a);
if (dst_fmt) {
*result = rv_fsgnj_d(a, b);
} else {
*result = nan_box(rv_fsgnj_s(check_boxing(a), check_boxing(b)));
}
}
void dpi_fsgnj(bool enable, int a, int b, int* result) {
void dpi_fsgnjn(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result) {
if (!enable)
return;
*result = rv_fsgnj_s(a, b);
if (dst_fmt) {
*result = rv_fsgnjn_d(a, b);
} else {
*result = nan_box(rv_fsgnjn_s(check_boxing(a), check_boxing(b)));
}
}
void dpi_fsgnjn(bool enable, int a, int b, int* result) {
void dpi_fsgnjx(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result) {
if (!enable)
return;
*result = rv_fsgnjn_s(a, b);
if (dst_fmt) {
*result = rv_fsgnjx_d(a, b);
} else {
*result = nan_box(rv_fsgnjx_s(check_boxing(a), check_boxing(b)));
}
}
void dpi_fsgnjx(bool enable, int a, int b, int* result) {
void dpi_flt(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
*result = rv_fsgnjx_s(a, b);
}
if (dst_fmt) {
*result = rv_flt_d(a, b, fflags);
} else {
*result = rv_flt_s(check_boxing(a), check_boxing(b), fflags);
}
}
void dpi_fle(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
if (dst_fmt) {
*result = rv_fle_d(a, b, fflags);
} else {
*result = rv_fle_s(check_boxing(a), check_boxing(b), fflags);
}
}
void dpi_feq(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
if (dst_fmt) {
*result = rv_feq_d(a, b, fflags);
} else {
*result = rv_feq_s(check_boxing(a), check_boxing(b), fflags);
}
}
void dpi_fmin(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
if (dst_fmt) {
*result = rv_fmin_d(a, b, fflags);
} else {
*result = nan_box(rv_fmin_s(check_boxing(a), check_boxing(b), fflags));
}
}
void dpi_fmax(bool enable, int dst_fmt, int64_t a, int64_t b, int64_t* result, svBitVecVal* fflags) {
if (!enable)
return;
if (dst_fmt) {
*result = rv_fmax_d(a, b, fflags);
} else {
*result = nan_box(rv_fmax_s(check_boxing(a), check_boxing(b), fflags));
}
}
// A is M * K, B is K * M, C is M * M, D is M * M
#define M 4
#define K 2 // FIXME: 4x4x1 / cycle / octet!
// all row major
float c_A_tile[M][K];
float c_B_tile[K][M];
float c_C_tile[M][M];
float c_D_tile[M][M];
// code assumes that svBitVecVal is basically a uint32_t
static_assert(sizeof(svBitVecVal) == 4);
void clear_float_array(float* c_tile, int rows, int cols) {
for (int i = 0; i < rows; i += 1) {
for (int j = 0; j < cols; j += 1) {
int index = i * cols + j;
c_tile[index] = 0.0f;
}
}
}
void fill_float_array(const svBitVecVal* sv_tile, float* c_tile, int rows, int cols) {
for (int i = 0; i < rows; i += 1) {
for (int j = 0; j < cols; j += 1) {
int index = i * cols + j;
svBitVecVal sv_val = sv_tile[index];
uint32_t c_val = sv_val;
float c_float;
memcpy(&c_float, &c_val, sizeof(c_float));
c_tile[index] = c_float;
// std::cout << c_float << " ";
}
// std::cout << std::endl;
}
}
void write_float_array(svBitVecVal* sv_tile, float* c_tile, int rows, int cols) {
for (int i = 0; i < rows; i += 1) {
for (int j = 0; j < cols; j += 1) {
int index = i * cols + j;
svBitVecVal* sv_val = &sv_tile[index];
float c_float = c_tile[index];
memcpy(sv_val, &c_float, sizeof(c_float));
// std::cout << c_float << " ";
}
// std::cout << std::endl;
}
}
void dpi_hmma(bool enable, const svBitVecVal* A_tile, const svBitVecVal* B_tile, const svBitVecVal* C_tile, svBitVecVal* D_tile) {
if (!enable) {
return;
}
clear_float_array(&c_A_tile[0][0], M, K);
clear_float_array(&c_B_tile[0][0], K, M);
clear_float_array(&c_C_tile[0][0], M, M);
clear_float_array(&c_D_tile[0][0], M, M);
// std::cout << "A: " << std::endl;
fill_float_array(A_tile, &c_A_tile[0][0], M, K);
// std::cout << "B: " << std::endl;
fill_float_array(B_tile, &c_B_tile[0][0], K, M);
// std::cout << "C: " << std::endl;
fill_float_array(C_tile, &c_C_tile[0][0], M, M);
for (int i = 0; i < M; i += 1) {
for (int j = 0; j < M; j += 1) {
float accum = c_C_tile[i][j];
for (int k = 0; k < K; k += 1) {
accum += c_A_tile[i][k] * c_B_tile[k][j];
}
c_D_tile[i][j] = accum;
}
}
write_float_array(D_tile, &c_D_tile[0][0], M, M);
}
// 1 copy per warp
float A_tile_full[4][16][8];
float B_tile_full[4][8][16];
float C_tile_full[4][16][16];
float D_tile_full[4][16][16];
int steps[4];
void print_array(float* array, int rows, int cols) {
for (int i = 0; i < rows; i += 1) {
for (int j = 0; j < cols; j += 1) {
std::cout << array[i*cols+j] << " ";
}
std::cout << "\n";
}
std::cout << std::endl;
}
void dpi_print_results(int wid, int octet, const svBitVecVal* A_tile, const svBitVecVal* B_tile, const svBitVecVal* C_tile, const svBitVecVal* D_tile) {
// std::cout << "A: " << std::endl;
fill_float_array(A_tile, &c_A_tile[0][0], M, K);
// std::cout << "B: " << std::endl;
fill_float_array(B_tile, &c_B_tile[0][0], K, M);
// std::cout << "C: " << std::endl;
fill_float_array(C_tile, &c_C_tile[0][0], M, M);
// for some reason this still holds onto old value? very strange
// std::cout << "D: " << std::endl;
fill_float_array(D_tile, &c_D_tile[0][0], M, M);
int octet_row_offset;
int octet_col_offset;
switch(octet) {
case 0:
octet_row_offset = 0;
octet_col_offset = 0;
break;
case 1:
octet_row_offset = 8;
octet_col_offset = 0;
break;
case 2:
octet_row_offset = 0;
octet_col_offset = 8;
break;
case 3:
octet_row_offset = 8;
octet_col_offset = 8;
break;
}
int step_row_offset;
int step_col_offset;
int step = (steps[wid] % 16) / 4;
int set = (steps[wid] / 16);
switch(step) {
case 0:
step_row_offset = 0;
step_col_offset = 0;
break;
case 1:
step_row_offset = 2;
step_col_offset = 0;
break;
case 2:
step_row_offset = 0;
step_col_offset = 4;
break;
case 3:
step_row_offset = 2;
step_col_offset = 4;
break;
}
if (steps[0] >= 48) {
// std::cout << "octet " << octet << " step " << steps[0] << "\n";
// print_array(&c_D_tile[0][0], 4, 4);
}
D_tile_full[wid][octet_row_offset+step_row_offset+0][octet_col_offset+step_col_offset+0] = c_D_tile[0][0];
D_tile_full[wid][octet_row_offset+step_row_offset+0][octet_col_offset+step_col_offset+1] = c_D_tile[0][1];
D_tile_full[wid][octet_row_offset+step_row_offset+0][octet_col_offset+step_col_offset+2] = c_D_tile[0][2];
D_tile_full[wid][octet_row_offset+step_row_offset+0][octet_col_offset+step_col_offset+3] = c_D_tile[0][3];
D_tile_full[wid][octet_row_offset+step_row_offset+1][octet_col_offset+step_col_offset+0] = c_D_tile[1][0];
D_tile_full[wid][octet_row_offset+step_row_offset+1][octet_col_offset+step_col_offset+1] = c_D_tile[1][1];
D_tile_full[wid][octet_row_offset+step_row_offset+1][octet_col_offset+step_col_offset+2] = c_D_tile[1][2];
D_tile_full[wid][octet_row_offset+step_row_offset+1][octet_col_offset+step_col_offset+3] = c_D_tile[1][3];
D_tile_full[wid][octet_row_offset+step_row_offset+4][octet_col_offset+step_col_offset+0] = c_D_tile[2][0];
D_tile_full[wid][octet_row_offset+step_row_offset+4][octet_col_offset+step_col_offset+1] = c_D_tile[2][1];
D_tile_full[wid][octet_row_offset+step_row_offset+4][octet_col_offset+step_col_offset+2] = c_D_tile[2][2];
D_tile_full[wid][octet_row_offset+step_row_offset+4][octet_col_offset+step_col_offset+3] = c_D_tile[2][3];
D_tile_full[wid][octet_row_offset+step_row_offset+5][octet_col_offset+step_col_offset+0] = c_D_tile[3][0];
D_tile_full[wid][octet_row_offset+step_row_offset+5][octet_col_offset+step_col_offset+1] = c_D_tile[3][1];
D_tile_full[wid][octet_row_offset+step_row_offset+5][octet_col_offset+step_col_offset+2] = c_D_tile[3][2];
D_tile_full[wid][octet_row_offset+step_row_offset+5][octet_col_offset+step_col_offset+3] = c_D_tile[3][3];
if (octet == 0 || octet == 1) {
octet_row_offset = octet * 8;
if (step == 0) {
step_row_offset = 0;
}
if (step == 1) {
step_row_offset = 2;
}
if (step == 0 || step == 1) {
A_tile_full[wid][octet_row_offset+step_row_offset+0][set*2+0] = c_A_tile[0][0];
A_tile_full[wid][octet_row_offset+step_row_offset+0][set*2+1] = c_A_tile[0][1];
A_tile_full[wid][octet_row_offset+step_row_offset+1][set*2+0] = c_A_tile[1][0];
A_tile_full[wid][octet_row_offset+step_row_offset+1][set*2+1] = c_A_tile[1][1];
A_tile_full[wid][octet_row_offset+step_row_offset+4][set*2+0] = c_A_tile[2][0];
A_tile_full[wid][octet_row_offset+step_row_offset+4][set*2+1] = c_A_tile[2][1];
A_tile_full[wid][octet_row_offset+step_row_offset+5][set*2+0] = c_A_tile[3][0];
A_tile_full[wid][octet_row_offset+step_row_offset+5][set*2+1] = c_A_tile[3][1];
}
}
if (octet == 0 || octet == 2) {
octet_col_offset = octet * 4;
if (step == 0) {
step_col_offset = 0;
}
else if (step == 2) {
step_col_offset = 4;
}
if (step == 0 || step == 2) {
B_tile_full[wid][set*2+0][octet_col_offset+step_col_offset+0] = c_B_tile[0][0];
B_tile_full[wid][set*2+0][octet_col_offset+step_col_offset+1] = c_B_tile[0][1];
B_tile_full[wid][set*2+0][octet_col_offset+step_col_offset+2] = c_B_tile[0][2];
B_tile_full[wid][set*2+0][octet_col_offset+step_col_offset+3] = c_B_tile[0][3];
B_tile_full[wid][set*2+1][octet_col_offset+step_col_offset+0] = c_B_tile[1][0];
B_tile_full[wid][set*2+1][octet_col_offset+step_col_offset+1] = c_B_tile[1][1];
B_tile_full[wid][set*2+1][octet_col_offset+step_col_offset+2] = c_B_tile[1][2];
B_tile_full[wid][set*2+1][octet_col_offset+step_col_offset+3] = c_B_tile[1][3];
}
}
steps[wid] += 1;
if (steps[wid] % 32 == 0) {
steps[wid] = 0;
std::cout << "warp " << wid << " finished wmma\n";
std::cout << "A tile" << "\n";
print_array(&A_tile_full[wid][0][0], 16, 8);
std::cout << "B tile" << "\n";
print_array(&B_tile_full[wid][0][0], 8, 16);
// std::cout << "C tile" << "\n";
// print_array(&C_tile_full[wid][0][0], 16, 16);
std::cout << "D tile" << "\n";
print_array(&D_tile_full[wid][0][0], 16, 16);
}
}

View File

@@ -1,31 +1,50 @@
`ifndef FLOAT_DPI
`define FLOAT_DPI
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
import "DPI-C" function void dpi_fadd(input logic enable, input int a, input int b, input bit[2:0] frm, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fsub(input logic enable, input int a, input int b, input bit[2:0] frm, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fmul(input logic enable, input int a, input int b, input bit[2:0] frm, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fmadd(input logic enable, input int a, input int b, input int c, input bit[2:0] frm, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fmsub(input logic enable, input int a, input int b, input int c, input bit[2:0] frm, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fnmadd(input logic enable, input int a, input int b, input int c, input bit[2:0] frm, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fnmsub(input logic enable, input int a, input int b, input int c, input bit[2:0] frm, output int result, output bit[4:0] fflags);
`ifndef FLOAT_DPI_VH
`define FLOAT_DPI_VH
import "DPI-C" function void dpi_fdiv(input logic enable, input int a, input int b, input bit[2:0] frm, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fsqrt(input logic enable, input int a, input bit[2:0] frm, output int result, output bit[4:0] fflags);
`include "VX_config.vh"
import "DPI-C" function void dpi_ftoi(input logic enable, input int a, input bit[2:0] frm, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_ftou(input logic enable, input int a, input bit[2:0] frm, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_itof(input logic enable, input int a, input bit[2:0] frm, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_utof(input logic enable, input int a, input bit[2:0] frm, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fadd(input logic enable, input int dst_fmt, input longint a, input longint b, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fsub(input logic enable, input int dst_fmt, input longint a, input longint b, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fmul(input logic enable, input int dst_fmt, input longint a, input longint b, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fmadd(input logic enable, input int dst_fmt, input longint a, input longint b, input longint c, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fmsub(input logic enable, input int dst_fmt, input longint a, input longint b, input longint c, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fnmadd(input logic enable, input int dst_fmt, input longint a, input longint b, input longint c, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fnmsub(input logic enable, input int dst_fmt, input longint a, input longint b, input longint c, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fclss(input logic enable, input int a, output int result);
import "DPI-C" function void dpi_fsgnj(input logic enable, input int a, input int b, output int result);
import "DPI-C" function void dpi_fsgnjn(input logic enable, input int a, input int b, output int result);
import "DPI-C" function void dpi_fsgnjx(input logic enable, input int a, input int b, output int result);
import "DPI-C" function void dpi_fdiv(input logic enable, input int dst_fmt, input longint a, input longint b, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fsqrt(input logic enable, input int dst_fmt, input longint a, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_flt(input logic enable, input int a, input int b, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fle(input logic enable, input int a, input int b, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_feq(input logic enable, input int a, input int b, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fmin(input logic enable, input int a, input int b, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fmax(input logic enable, input int a, input int b, output int result, output bit[4:0] fflags);
import "DPI-C" function void dpi_ftoi(input logic enable, input int dst_fmt, input int src_fmt, input longint a, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_ftou(input logic enable, input int dst_fmt, input int src_fmt, input longint a, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_itof(input logic enable, input int dst_fmt, input int src_fmt, input longint a, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_utof(input logic enable, input int dst_fmt, input int src_fmt, input longint a, input bit[2:0] frm, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_f2f(input logic enable, input int dst_fmt, input longint a, output longint result);
`endif
import "DPI-C" function void dpi_fclss(input logic enable, input int dst_fmt, input longint a, output longint result);
import "DPI-C" function void dpi_fsgnj(input logic enable, input int dst_fmt, input longint a, input longint b, output longint result);
import "DPI-C" function void dpi_fsgnjn(input logic enable, input int dst_fmt, input longint a, input longint b, output longint result);
import "DPI-C" function void dpi_fsgnjx(input logic enable, input int dst_fmt, input longint a, input longint b, output longint result);
import "DPI-C" function void dpi_flt(input logic enable, input int dst_fmt, input longint a, input longint b, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fle(input logic enable, input int dst_fmt, input longint a, input longint b, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_feq(input logic enable, input int dst_fmt, input longint a, input longint b, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fmin(input logic enable, input int dst_fmt, input longint a, input longint b, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_fmax(input logic enable, input int dst_fmt, input longint a, input longint b, output longint result, output bit[4:0] fflags);
import "DPI-C" function void dpi_hmma(input logic enable, input bit[3:0][1:0][31:0] A_tile, input bit[1:0][3:0][31:0] B_tile, input bit[3:0][3:0][31:0] C_tile, output bit[3:0][3:0][31:0] D_tile);
import "DPI-C" function void dpi_print_results(input int wid, input int octet, input bit[3:0][1:0][31:0] A_tile, input bit[1:0][3:0][31:0] B_tile, input bit[3:0][3:0][31:0] C_tile, input bit[3:0][3:0][31:0] D_tile);
`endif

4018
hw/dpi/half.h Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -1,23 +1,58 @@
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <stdio.h>
#include <stdarg.h>
#include <math.h>
#include <unordered_map>
#include <vector>
#include <mutex>
#include <memory>
#include <iostream>
#include "svdpi.h"
#include "verilated_vpi.h"
#include "VX_config.h"
// #include "verilated_vpi.h"
#include "uuid_gen.h"
#ifdef XLEN_64
#define iword_t int64_t
#define uword_t uint64_t
#define idword_t __int128_t
#define udword_t __uint128_t
#else
#define iword_t int32_t
#define uword_t uint32_t
#define idword_t int64_t
#define udword_t uint64_t
#endif
#ifndef DEBUG_LEVEL
#define DEBUG_LEVEL 3
#endif
extern "C" {
void dpi_imul(bool enable, int a, int b, bool is_signed_a, bool is_signed_b, int* resultl, int* resulth);
void dpi_idiv(bool enable, int a, int b, bool is_signed, int* quotient, int* remainder);
void dpi_imul(bool enable, bool is_signed_a, bool is_signed_b, iword_t a, iword_t b, iword_t* resultl, iword_t* resulth);
void dpi_idiv(bool enable, bool is_signed, iword_t a, iword_t b, iword_t* quotient, iword_t* remainder);
int dpi_register();
void dpi_assert(int inst, bool cond, int delay);
void dpi_trace(const char* format, ...);
void dpi_trace(int level, const char* format, ...);
void dpi_trace_start();
void dpi_trace_stop();
uint64_t dpi_uuid_gen(bool reset, int wid, uint64_t PC);
}
bool sim_trace_enabled();
@@ -93,49 +128,54 @@ void dpi_assert(int inst, bool cond, int delay) {
}
}
void dpi_imul(bool enable, int a, int b, bool is_signed_a, bool is_signed_b, int* resultl, int* resulth) {
///////////////////////////////////////////////////////////////////////////////
void dpi_imul(bool enable, bool is_signed_a, bool is_signed_b, iword_t a, iword_t b, iword_t* resultl, iword_t* resulth) {
if (!enable)
return;
udword_t first = *(uword_t*)&a;
udword_t second = *(uword_t*)&b;
udword_t mask = udword_t(-1) << (8 * sizeof(iword_t));
uint64_t first = *(uint32_t*)&a;
uint64_t second = *(uint32_t*)&b;
if (is_signed_a && (first & 0x80000000)) {
first |= 0xFFFFFFFF00000000;
if (is_signed_a && a < 0) {
first |= mask;
}
if (is_signed_b && (second & 0x80000000)) {
second |= 0xFFFFFFFF00000000;
if (is_signed_b && b < 0) {
second |= mask;
}
uint64_t result;
udword_t result;
if (is_signed_a || is_signed_b) {
result = (int64_t)first * (int64_t)second;
result = idword_t(first) * idword_t(second);
} else {
result = first * second;
}
*resultl = result & 0xFFFFFFFF;
*resulth = (result >> 32) & 0xFFFFFFFF;
}
*resultl = iword_t(result);
*resulth = iword_t(result >> (8 * sizeof(iword_t)));
}
void dpi_idiv(bool enable, int a, int b, bool is_signed, int* quotient, int* remainder) {
void dpi_idiv(bool enable, bool is_signed, iword_t a, iword_t b, iword_t* quotient, iword_t* remainder) {
if (!enable)
return;
uint32_t dividen = *(uint32_t*)&a;
uint32_t divisor = *(uint32_t*)&b;
uword_t dividen = a;
uword_t divisor = b;
auto inf_neg = uword_t(1) << (8 * sizeof(iword_t) - 1);
if (is_signed) {
if (b == 0) {
*quotient = -1;
*remainder = dividen;
} else if (dividen == 0x80000000 && divisor == 0xffffffff) {
} else if (dividen == inf_neg && divisor == -1) {
*remainder = 0;
*quotient = dividen;
} else {
*quotient = (int32_t)dividen / (int32_t)divisor;
*remainder = (int32_t)dividen % (int32_t)divisor;
*quotient = (iword_t)dividen / (iword_t)divisor;
*remainder = (iword_t)dividen % (iword_t)divisor;
}
} else {
if (b == 0) {
@@ -148,9 +188,13 @@ void dpi_idiv(bool enable, int a, int b, bool is_signed, int* quotient, int* rem
}
}
void dpi_trace(const char* format, ...) {
if (!sim_trace_enabled())
///////////////////////////////////////////////////////////////////////////////
void dpi_trace(int level, const char* format, ...) {
if (level > DEBUG_LEVEL)
return;
// if (!sim_trace_enabled())
// return;
va_list va;
va_start(va, format);
vprintf(format, va);
@@ -158,9 +202,33 @@ void dpi_trace(const char* format, ...) {
}
void dpi_trace_start() {
sim_trace_enable(true);
// sim_trace_enable(true);
}
void dpi_trace_stop() {
sim_trace_enable(false);
}
// sim_trace_enable(false);
}
///////////////////////////////////////////////////////////////////////////////
std::unordered_map<uint32_t, std::shared_ptr<vortex::UUIDGenerator>> g_uuid_gens;
uint64_t dpi_uuid_gen(bool reset, int wid, uint64_t PC) {
if (reset) {
g_uuid_gens.clear();
return 0;
}
std::shared_ptr<vortex::UUIDGenerator> uuid_gen;
auto it = g_uuid_gens.find(wid);
if (it == g_uuid_gens.end()) {
uuid_gen = std::make_shared<vortex::UUIDGenerator>();
g_uuid_gens.emplace(wid, uuid_gen);
} else {
uuid_gen = it->second;
}
uint32_t instr_uuid = uuid_gen->get_uuid(PC);
uint32_t instr_id = instr_uuid & 0xffff;
uint32_t instr_ref = instr_uuid >> 16;
uint64_t uuid = (uint64_t(instr_ref) << 32) | (wid << 16) | instr_id;
return uuid;
}

View File

@@ -1,14 +1,37 @@
`ifndef UTIL_DPI
`define UTIL_DPI
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
import "DPI-C" function void dpi_imul(input logic enable, input int a, input int b, input logic is_signed_a, input logic is_signed_b, output int resultl, output int resulth);
import "DPI-C" function void dpi_idiv(input logic enable, input int a, input int b, input logic is_signed, output int quotient, output int remainder);
`ifndef UTIL_DPI_VH
`define UTIL_DPI_VH
`include "VX_config.vh"
`ifdef XLEN_64
`define INT_TYPE longint
`else
`define INT_TYPE int
`endif
import "DPI-C" function void dpi_imul(input logic enable, input logic is_signed_a, input logic is_signed_b, input `INT_TYPE a, input `INT_TYPE b, output `INT_TYPE resultl, output `INT_TYPE resulth);
import "DPI-C" function void dpi_idiv(input logic enable, input logic is_signed, input `INT_TYPE a, input `INT_TYPE b, output `INT_TYPE quotient, output `INT_TYPE remainder);
import "DPI-C" function int dpi_register();
import "DPI-C" function void dpi_assert(int inst, input logic cond, input int delay);
import "DPI-C" function void dpi_trace(input string format /*verilator sformat*/);
import "DPI-C" function void dpi_trace(input int level, input string format /*verilator sformat*/);
import "DPI-C" function void dpi_trace_start();
import "DPI-C" function void dpi_trace_stop();
`endif
import "DPI-C" function longint dpi_uuid_gen(input logic reset, input int wid, input longint PC);
`endif

1
hw/rtl/.gitignore vendored
View File

@@ -1 +0,0 @@
/VX_user_config.vh

View File

@@ -1,235 +0,0 @@
`include "VX_define.vh"
module VX_alu_unit #(
parameter CORE_ID = 0
) (
input wire clk,
input wire reset,
// Inputs
VX_alu_req_if.slave alu_req_if,
// Outputs
VX_branch_ctl_if.master branch_ctl_if,
VX_commit_if.master alu_commit_if
);
`UNUSED_PARAM (CORE_ID)
reg [`NUM_THREADS-1:0][31:0] alu_result;
wire [`NUM_THREADS-1:0][31:0] add_result;
wire [`NUM_THREADS-1:0][32:0] sub_result;
wire [`NUM_THREADS-1:0][31:0] shr_result;
reg [`NUM_THREADS-1:0][31:0] msc_result;
wire ready_in;
`UNUSED_VAR (alu_req_if.op_mod)
wire is_br_op = `INST_ALU_IS_BR(alu_req_if.op_mod);
wire [`INST_ALU_BITS-1:0] alu_op = `INST_ALU_BITS'(alu_req_if.op_type);
wire [`INST_BR_BITS-1:0] br_op = `INST_BR_BITS'(alu_req_if.op_type);
wire alu_signed = `INST_ALU_SIGNED(alu_op);
wire [1:0] alu_op_class = `INST_ALU_OP_CLASS(alu_op);
wire is_sub = (alu_op == `INST_ALU_SUB);
wire [`NUM_THREADS-1:0][31:0] alu_in1 = alu_req_if.rs1_data;
wire [`NUM_THREADS-1:0][31:0] alu_in2 = alu_req_if.rs2_data;
wire [`NUM_THREADS-1:0][31:0] alu_in1_PC = alu_req_if.use_PC ? {`NUM_THREADS{alu_req_if.PC}} : alu_in1;
wire [`NUM_THREADS-1:0][31:0] alu_in2_imm = alu_req_if.use_imm ? {`NUM_THREADS{alu_req_if.imm}} : alu_in2;
wire [`NUM_THREADS-1:0][31:0] alu_in2_less = (alu_req_if.use_imm && ~is_br_op) ? {`NUM_THREADS{alu_req_if.imm}} : alu_in2;
for (genvar i = 0; i < `NUM_THREADS; i++) begin
assign add_result[i] = alu_in1_PC[i] + alu_in2_imm[i];
end
for (genvar i = 0; i < `NUM_THREADS; i++) begin
wire [32:0] sub_in1 = {alu_signed & alu_in1[i][31], alu_in1[i]};
wire [32:0] sub_in2 = {alu_signed & alu_in2_less[i][31], alu_in2_less[i]};
assign sub_result[i] = sub_in1 - sub_in2;
end
for (genvar i = 0; i < `NUM_THREADS; i++) begin
wire [32:0] shr_in1 = {alu_signed & alu_in1[i][31], alu_in1[i]};
assign shr_result[i] = 32'($signed(shr_in1) >>> alu_in2_imm[i][4:0]);
end
for (genvar i = 0; i < `NUM_THREADS; i++) begin
always @(*) begin
case (alu_op)
`INST_ALU_AND: msc_result[i] = alu_in1[i] & alu_in2_imm[i];
`INST_ALU_OR: msc_result[i] = alu_in1[i] | alu_in2_imm[i];
`INST_ALU_XOR: msc_result[i] = alu_in1[i] ^ alu_in2_imm[i];
//`INST_ALU_SLL,
default: msc_result[i] = alu_in1[i] << alu_in2_imm[i][4:0];
endcase
end
end
for (genvar i = 0; i < `NUM_THREADS; i++) begin
always @(*) begin
case (alu_op_class)
2'b00: alu_result[i] = add_result[i]; // ADD, LUI, AUIPC
2'b01: alu_result[i] = {31'b0, sub_result[i][32]}; // SLTU, SLT
2'b10: alu_result[i] = is_sub ? sub_result[i][31:0] // SUB
: shr_result[i]; // SRL, SRA
// 2'b11,
default: alu_result[i] = msc_result[i]; // AND, OR, XOR, SLL
endcase
end
end
// branch
wire is_jal = is_br_op && (br_op == `INST_BR_JAL || br_op == `INST_BR_JALR);
wire [`NUM_THREADS-1:0][31:0] alu_jal_result = is_jal ? {`NUM_THREADS{alu_req_if.next_PC}} : alu_result;
wire [31:0] br_dest = add_result[alu_req_if.tid];
wire [32:0] cmp_result = sub_result[alu_req_if.tid];
wire is_less = cmp_result[32];
wire is_equal = ~(| cmp_result[31:0]);
// output
wire alu_valid_in;
wire alu_ready_in;
wire alu_valid_out;
wire alu_ready_out;
wire [`UUID_BITS-1:0] alu_uuid;
wire [`NW_BITS-1:0] alu_wid;
wire [`NUM_THREADS-1:0] alu_tmask;
wire [31:0] alu_PC;
wire [`NR_BITS-1:0] alu_rd;
wire alu_wb;
wire [`NUM_THREADS-1:0][31:0] alu_data;
wire [`INST_BR_BITS-1:0] br_op_r;
wire [31:0] br_dest_r;
wire is_less_r;
wire is_equal_r;
wire is_br_op_r;
assign alu_ready_in = alu_ready_out || ~alu_valid_out;
VX_pipe_register #(
.DATAW (1 + `UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `NR_BITS + 1 + (`NUM_THREADS * 32) + 1 + `INST_BR_BITS + 1 + 1 + 32),
.RESETW (1)
) pipe_reg (
.clk (clk),
.reset (reset),
.enable (alu_ready_in),
.data_in ({alu_valid_in, alu_req_if.uuid, alu_req_if.wid, alu_req_if.tmask, alu_req_if.PC, alu_req_if.rd, alu_req_if.wb, alu_jal_result, is_br_op, br_op, is_less, is_equal, br_dest}),
.data_out ({alu_valid_out, alu_uuid, alu_wid, alu_tmask, alu_PC, alu_rd, alu_wb, alu_data, is_br_op_r, br_op_r, is_less_r, is_equal_r, br_dest_r})
);
`UNUSED_VAR (br_op_r)
wire br_neg = `INST_BR_NEG(br_op_r);
wire br_less = `INST_BR_LESS(br_op_r);
wire br_static = `INST_BR_STATIC(br_op_r);
assign branch_ctl_if.valid = alu_valid_out && alu_ready_out && is_br_op_r;
assign branch_ctl_if.taken = ((br_less ? is_less_r : is_equal_r) ^ br_neg) | br_static;
assign branch_ctl_if.wid = alu_wid;
assign branch_ctl_if.dest = br_dest_r;
`ifdef EXT_M_ENABLE
wire mul_valid_in;
wire mul_ready_in;
wire mul_valid_out;
wire mul_ready_out;
wire [`UUID_BITS-1:0] mul_uuid;
wire [`NW_BITS-1:0] mul_wid;
wire [`NUM_THREADS-1:0] mul_tmask;
wire [31:0] mul_PC;
wire [`NR_BITS-1:0] mul_rd;
wire mul_wb;
wire [`NUM_THREADS-1:0][31:0] mul_data;
wire [`INST_MUL_BITS-1:0] mul_op = `INST_MUL_BITS'(alu_req_if.op_type);
VX_muldiv muldiv (
.clk (clk),
.reset (reset),
// Inputs
.alu_op (mul_op),
.uuid_in (alu_req_if.uuid),
.wid_in (alu_req_if.wid),
.tmask_in (alu_req_if.tmask),
.PC_in (alu_req_if.PC),
.rd_in (alu_req_if.rd),
.wb_in (alu_req_if.wb),
.alu_in1 (alu_req_if.rs1_data),
.alu_in2 (alu_req_if.rs2_data),
// Outputs
.wid_out (mul_wid),
.uuid_out (mul_uuid),
.tmask_out (mul_tmask),
.PC_out (mul_PC),
.rd_out (mul_rd),
.wb_out (mul_wb),
.data_out (mul_data),
// handshake
.valid_in (mul_valid_in),
.ready_in (mul_ready_in),
.valid_out (mul_valid_out),
.ready_out (mul_ready_out)
);
wire is_mul_op = `INST_ALU_IS_MUL(alu_req_if.op_mod);
assign ready_in = is_mul_op ? mul_ready_in : alu_ready_in;
assign alu_valid_in = alu_req_if.valid && ~is_mul_op;
assign mul_valid_in = alu_req_if.valid && is_mul_op;
assign alu_commit_if.valid = alu_valid_out || mul_valid_out;
assign alu_commit_if.uuid = alu_valid_out ? alu_uuid : mul_uuid;
assign alu_commit_if.wid = alu_valid_out ? alu_wid : mul_wid;
assign alu_commit_if.tmask = alu_valid_out ? alu_tmask : mul_tmask;
assign alu_commit_if.PC = alu_valid_out ? alu_PC : mul_PC;
assign alu_commit_if.rd = alu_valid_out ? alu_rd : mul_rd;
assign alu_commit_if.wb = alu_valid_out ? alu_wb : mul_wb;
assign alu_commit_if.data = alu_valid_out ? alu_data : mul_data;
assign alu_ready_out = alu_commit_if.ready;
assign mul_ready_out = alu_commit_if.ready & ~alu_valid_out; // ALU takes priority
`else
assign ready_in = alu_ready_in;
assign alu_valid_in = alu_req_if.valid;
assign alu_commit_if.valid = alu_valid_out;
assign alu_commit_if.uuid = alu_uuid;
assign alu_commit_if.wid = alu_wid;
assign alu_commit_if.tmask = alu_tmask;
assign alu_commit_if.PC = alu_PC;
assign alu_commit_if.rd = alu_rd;
assign alu_commit_if.wb = alu_wb;
assign alu_commit_if.data = alu_data;
assign alu_ready_out = alu_commit_if.ready;
`endif
assign alu_commit_if.eop = 1'b1;
// can accept new request?
assign alu_req_if.ready = ready_in;
`ifdef DBG_TRACE_CORE_PIPELINE
always @(posedge clk) begin
if (branch_ctl_if.valid) begin
dpi_trace("%d: core%0d-branch: wid=%0d, PC=%0h, taken=%b, dest=%0h (#%0d)\n",
$time, CORE_ID, branch_ctl_if.wid, alu_commit_if.PC, branch_ctl_if.taken, branch_ctl_if.dest, alu_uuid);
end
end
`endif
endmodule

View File

@@ -1,159 +0,0 @@
`include "VX_define.vh"
module VX_cache_arb #(
parameter NUM_REQS = 1,
parameter LANES = 1,
parameter DATA_SIZE = 1,
parameter TAG_IN_WIDTH = 1,
parameter TAG_SEL_IDX = 0,
parameter BUFFERED_REQ = 0,
parameter BUFFERED_RSP = 0,
parameter TYPE = "R",
localparam ADDR_WIDTH = (32-`CLOG2(DATA_SIZE)),
localparam DATA_WIDTH = (8 * DATA_SIZE),
localparam LOG_NUM_REQS = `CLOG2(NUM_REQS),
localparam TAG_OUT_WIDTH = TAG_IN_WIDTH + LOG_NUM_REQS
) (
input wire clk,
input wire reset,
// input requests
input wire [NUM_REQS-1:0][LANES-1:0] req_valid_in,
input wire [NUM_REQS-1:0][LANES-1:0] req_rw_in,
input wire [NUM_REQS-1:0][LANES-1:0][DATA_SIZE-1:0] req_byteen_in,
input wire [NUM_REQS-1:0][LANES-1:0][ADDR_WIDTH-1:0] req_addr_in,
input wire [NUM_REQS-1:0][LANES-1:0][DATA_WIDTH-1:0] req_data_in,
input wire [NUM_REQS-1:0][LANES-1:0][TAG_IN_WIDTH-1:0] req_tag_in,
output wire [NUM_REQS-1:0][LANES-1:0] req_ready_in,
// output request
output wire [LANES-1:0] req_valid_out,
output wire [LANES-1:0] req_rw_out,
output wire [LANES-1:0][DATA_SIZE-1:0] req_byteen_out,
output wire [LANES-1:0][ADDR_WIDTH-1:0] req_addr_out,
output wire [LANES-1:0][DATA_WIDTH-1:0] req_data_out,
output wire [LANES-1:0][TAG_OUT_WIDTH-1:0] req_tag_out,
input wire [LANES-1:0] req_ready_out,
// input response
input wire rsp_valid_in,
input wire [LANES-1:0] rsp_tmask_in,
input wire [LANES-1:0][DATA_WIDTH-1:0] rsp_data_in,
input wire [TAG_OUT_WIDTH-1:0] rsp_tag_in,
output wire rsp_ready_in,
// output responses
output wire [NUM_REQS-1:0] rsp_valid_out,
output wire [NUM_REQS-1:0][LANES-1:0] rsp_tmask_out,
output wire [NUM_REQS-1:0][LANES-1:0][DATA_WIDTH-1:0] rsp_data_out,
output wire [NUM_REQS-1:0][TAG_IN_WIDTH-1:0] rsp_tag_out,
input wire [NUM_REQS-1:0] rsp_ready_out
);
localparam REQ_DATAW = TAG_OUT_WIDTH + ADDR_WIDTH + 1 + DATA_SIZE + DATA_WIDTH;
localparam RSP_DATAW = LANES * (1 + DATA_WIDTH) + TAG_IN_WIDTH;
if (NUM_REQS > 1) begin
wire [NUM_REQS-1:0][LANES-1:0][REQ_DATAW-1:0] req_data_in_merged;
wire [LANES-1:0][REQ_DATAW-1:0] req_data_out_merged;
for (genvar i = 0; i < NUM_REQS; i++) begin
for (genvar j = 0; j < LANES; ++j) begin
wire [TAG_OUT_WIDTH-1:0] req_tag_in_w;
VX_bits_insert #(
.N (TAG_IN_WIDTH),
.S (LOG_NUM_REQS),
.POS (TAG_SEL_IDX)
) bits_insert (
.data_in (req_tag_in[i][j]),
.sel_in (LOG_NUM_REQS'(i)),
.data_out (req_tag_in_w)
);
assign req_data_in_merged[i][j] = {req_tag_in_w, req_addr_in[i][j], req_rw_in[i][j], req_byteen_in[i][j], req_data_in[i][j]};
end
end
VX_stream_arbiter #(
.NUM_REQS (NUM_REQS),
.LANES (LANES),
.DATAW (REQ_DATAW),
.BUFFERED (BUFFERED_REQ),
.TYPE (TYPE)
) req_arb (
.clk (clk),
.reset (reset),
.valid_in (req_valid_in),
.data_in (req_data_in_merged),
.ready_in (req_ready_in),
.valid_out (req_valid_out),
.data_out (req_data_out_merged),
.ready_out (req_ready_out)
);
for (genvar i = 0; i < LANES; ++i) begin
assign {req_tag_out[i], req_addr_out[i], req_rw_out[i], req_byteen_out[i], req_data_out[i]} = req_data_out_merged[i];
end
///////////////////////////////////////////////////////////////////////
wire [NUM_REQS-1:0][RSP_DATAW-1:0] rsp_data_out_merged;
wire [LOG_NUM_REQS-1:0] rsp_sel = rsp_tag_in[TAG_SEL_IDX +: LOG_NUM_REQS];
wire [TAG_IN_WIDTH-1:0] rsp_tag_in_w;
VX_bits_remove #(
.N (TAG_OUT_WIDTH),
.S (LOG_NUM_REQS),
.POS (TAG_SEL_IDX)
) bits_remove (
.data_in (rsp_tag_in),
.data_out (rsp_tag_in_w)
);
VX_stream_demux #(
.NUM_REQS (NUM_REQS),
.LANES (1),
.DATAW (RSP_DATAW),
.BUFFERED (BUFFERED_RSP)
) rsp_demux (
.clk (clk),
.reset (reset),
.sel_in (rsp_sel),
.valid_in (rsp_valid_in),
.data_in ({rsp_tmask_in, rsp_tag_in_w, rsp_data_in}),
.ready_in (rsp_ready_in),
.valid_out (rsp_valid_out),
.data_out (rsp_data_out_merged),
.ready_out (rsp_ready_out)
);
for (genvar i = 0; i < NUM_REQS; i++) begin
assign {rsp_tmask_out[i], rsp_tag_out[i], rsp_data_out[i]} = rsp_data_out_merged[i];
end
end else begin
`UNUSED_VAR (clk)
`UNUSED_VAR (reset)
assign req_valid_out = req_valid_in;
assign req_tag_out = req_tag_in;
assign req_addr_out = req_addr_in;
assign req_rw_out = req_rw_in;
assign req_byteen_out = req_byteen_in;
assign req_data_out = req_data_in;
assign req_ready_in = req_ready_out;
assign rsp_valid_out = rsp_valid_in;
assign rsp_tmask_out = rsp_tmask_in;
assign rsp_tag_out = rsp_tag_in;
assign rsp_data_out = rsp_data_in;
assign rsp_ready_in = rsp_ready_out;
end
endmodule

View File

@@ -1,195 +1,171 @@
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
`include "VX_define.vh"
module VX_cluster #(
module VX_cluster import VX_gpu_pkg::*; #(
parameter CLUSTER_ID = 0
) (
`SCOPE_IO_VX_cluster
`SCOPE_IO_DECL
// Clock
input wire clk,
input wire reset,
input wire clk,
input wire reset,
// Memory request
output wire mem_req_valid,
output wire mem_req_rw,
output wire [`L2_MEM_BYTEEN_WIDTH-1:0] mem_req_byteen,
output wire [`L2_MEM_ADDR_WIDTH-1:0] mem_req_addr,
output wire [`L2_MEM_DATA_WIDTH-1:0] mem_req_data,
output wire [`L2_MEM_TAG_WIDTH-1:0] mem_req_tag,
input wire mem_req_ready,
`ifdef PERF_ENABLE
VX_mem_perf_if.slave mem_perf_if,
`endif
// Memory response
input wire mem_rsp_valid,
input wire [`L2_MEM_DATA_WIDTH-1:0] mem_rsp_data,
input wire [`L2_MEM_TAG_WIDTH-1:0] mem_rsp_tag,
output wire mem_rsp_ready,
// DCRs
VX_dcr_bus_if.slave dcr_bus_if,
// Memory
VX_mem_bus_if.master mem_bus_if,
// simulation helper signals
output wire sim_ebreak,
output wire [`NUM_REGS-1:0][`XLEN-1:0] sim_wb_value,
// Status
output wire busy
);
`STATIC_ASSERT((`L2_ENABLE == 0 || `NUM_CORES > 1), ("invalid parameter"))
output wire busy
);
wire [`NUM_CORES-1:0] per_core_mem_req_valid;
wire [`NUM_CORES-1:0] per_core_mem_req_rw;
wire [`NUM_CORES-1:0][`DCACHE_MEM_BYTEEN_WIDTH-1:0] per_core_mem_req_byteen;
wire [`NUM_CORES-1:0][`DCACHE_MEM_ADDR_WIDTH-1:0] per_core_mem_req_addr;
wire [`NUM_CORES-1:0][`DCACHE_MEM_DATA_WIDTH-1:0] per_core_mem_req_data;
wire [`NUM_CORES-1:0][`L1_MEM_TAG_WIDTH-1:0] per_core_mem_req_tag;
wire [`NUM_CORES-1:0] per_core_mem_req_ready;
`ifdef SCOPE
localparam scope_socket = 0;
`SCOPE_IO_SWITCH (scope_socket + `NUM_SOCKETS);
`endif
wire [`NUM_CORES-1:0] per_core_mem_rsp_valid;
wire [`NUM_CORES-1:0][`DCACHE_MEM_DATA_WIDTH-1:0] per_core_mem_rsp_data;
wire [`NUM_CORES-1:0][`L1_MEM_TAG_WIDTH-1:0] per_core_mem_rsp_tag;
wire [`NUM_CORES-1:0] per_core_mem_rsp_ready;
`ifdef PERF_ENABLE
VX_mem_perf_if mem_perf_tmp_if();
assign mem_perf_tmp_if.icache = 'x;
assign mem_perf_tmp_if.dcache = 'x;
assign mem_perf_tmp_if.l3cache = mem_perf_if.l3cache;
assign mem_perf_tmp_if.smem = 'x;
assign mem_perf_tmp_if.mem = mem_perf_if.mem;
`endif
wire [`NUM_CORES-1:0] per_core_busy;
`ifdef GBAR_ENABLE
for (genvar i = 0; i < `NUM_CORES; i++) begin
VX_gbar_bus_if per_socket_gbar_bus_if[`NUM_SOCKETS]();
VX_gbar_bus_if gbar_bus_if();
`RESET_RELAY (core_reset);
`RESET_RELAY (gbar_reset, reset);
VX_core #(
.CORE_ID(i + (CLUSTER_ID * `NUM_CORES))
) core (
`SCOPE_BIND_VX_cluster_core(i)
VX_gbar_arb #(
.NUM_REQS (`NUM_SOCKETS),
.OUT_REG ((`NUM_SOCKETS > 2) ? 1 : 0) // bgar_unit has no backpressure
) gbar_arb (
.clk (clk),
.reset (gbar_reset),
.bus_in_if (per_socket_gbar_bus_if),
.bus_out_if (gbar_bus_if)
);
VX_gbar_unit #(
.INSTANCE_ID ($sformatf("gbar%0d", CLUSTER_ID))
) gbar_unit (
.clk (clk),
.reset (gbar_reset),
.gbar_bus_if (gbar_bus_if)
);
`endif
VX_mem_bus_if #(
.DATA_SIZE (`L1_LINE_SIZE),
.TAG_WIDTH (L1_MEM_ARB_TAG_WIDTH)
) per_socket_mem_bus_if[`NUM_SOCKETS]();
`RESET_RELAY (l2_reset, reset);
VX_cache_wrap #(
.INSTANCE_ID ("l2cache"),
.CACHE_SIZE (`L2_CACHE_SIZE),
.LINE_SIZE (`L2_LINE_SIZE),
.NUM_BANKS (`L2_NUM_BANKS),
.NUM_WAYS (`L2_NUM_WAYS),
.WORD_SIZE (L2_WORD_SIZE),
.NUM_REQS (L2_NUM_REQS),
.CRSQ_SIZE (`L2_CRSQ_SIZE),
.MSHR_SIZE (`L2_MSHR_SIZE),
.MRSQ_SIZE (`L2_MRSQ_SIZE),
.MREQ_SIZE (`L2_MREQ_SIZE),
.TAG_WIDTH (L2_TAG_WIDTH),
.WRITE_ENABLE (1),
.UUID_WIDTH (`UUID_WIDTH),
.CORE_OUT_REG (2),
.MEM_OUT_REG (2),
.NC_ENABLE (1),
.PASSTHRU (!`L2_ENABLED)
) l2cache (
.clk (clk),
.reset (l2_reset),
`ifdef PERF_ENABLE
.cache_perf (mem_perf_tmp_if.l2cache),
`endif
.core_bus_if (per_socket_mem_bus_if),
.mem_bus_if (mem_bus_if)
);
///////////////////////////////////////////////////////////////////////////
wire [`NUM_SOCKETS-1:0] per_socket_sim_ebreak;
wire [`NUM_SOCKETS-1:0][`NUM_REGS-1:0][`XLEN-1:0] per_socket_sim_wb_value;
assign sim_ebreak = per_socket_sim_ebreak[0];
assign sim_wb_value = per_socket_sim_wb_value[0];
`UNUSED_VAR (per_socket_sim_ebreak)
`UNUSED_VAR (per_socket_sim_wb_value)
VX_dcr_bus_if socket_dcr_bus_tmp_if();
assign socket_dcr_bus_tmp_if.write_valid = dcr_bus_if.write_valid && (dcr_bus_if.write_addr >= `VX_DCR_BASE_STATE_BEGIN && dcr_bus_if.write_addr < `VX_DCR_BASE_STATE_END);
assign socket_dcr_bus_tmp_if.write_addr = dcr_bus_if.write_addr;
assign socket_dcr_bus_tmp_if.write_data = dcr_bus_if.write_data;
wire [`NUM_SOCKETS-1:0] per_socket_busy;
`BUFFER_DCR_BUS_IF (socket_dcr_bus_if, socket_dcr_bus_tmp_if, (`NUM_SOCKETS > 1));
// Generate all sockets
for (genvar i = 0; i < `NUM_SOCKETS; ++i) begin
`RESET_RELAY (socket_reset, reset);
VX_socket #(
.SOCKET_ID ((CLUSTER_ID * `NUM_SOCKETS) + i)
) socket (
`SCOPE_IO_BIND (scope_socket+i)
.clk (clk),
.reset (core_reset),
.mem_req_valid (per_core_mem_req_valid[i]),
.mem_req_rw (per_core_mem_req_rw [i]),
.mem_req_byteen (per_core_mem_req_byteen[i]),
.mem_req_addr (per_core_mem_req_addr [i]),
.mem_req_data (per_core_mem_req_data [i]),
.mem_req_tag (per_core_mem_req_tag [i]),
.mem_req_ready (per_core_mem_req_ready[i]),
.mem_rsp_valid (per_core_mem_rsp_valid[i]),
.mem_rsp_data (per_core_mem_rsp_data [i]),
.mem_rsp_tag (per_core_mem_rsp_tag [i]),
.mem_rsp_ready (per_core_mem_rsp_ready[i]),
.busy (per_core_busy [i])
);
end
assign busy = (| per_core_busy);
if (`L2_ENABLE) begin
`ifdef PERF_ENABLE
VX_perf_cache_if perf_l2cache_if();
`endif
`RESET_RELAY (l2_reset);
VX_cache #(
.CACHE_ID (`L2_CACHE_ID),
.CACHE_SIZE (`L2_CACHE_SIZE),
.CACHE_LINE_SIZE (`L2_CACHE_LINE_SIZE),
.NUM_BANKS (`L2_NUM_BANKS),
.NUM_PORTS (`L2_NUM_PORTS),
.WORD_SIZE (`L2_WORD_SIZE),
.NUM_REQS (`L2_NUM_REQS),
.CREQ_SIZE (`L2_CREQ_SIZE),
.CRSQ_SIZE (`L2_CRSQ_SIZE),
.MSHR_SIZE (`L2_MSHR_SIZE),
.MRSQ_SIZE (`L2_MRSQ_SIZE),
.MREQ_SIZE (`L2_MREQ_SIZE),
.WRITE_ENABLE (1),
.CORE_TAG_WIDTH (`L1_MEM_TAG_WIDTH),
.CORE_TAG_ID_BITS (0),
.MEM_TAG_WIDTH (`L2_MEM_TAG_WIDTH),
.NC_ENABLE (1)
) l2cache (
`SCOPE_BIND_VX_cluster_l2cache
.clk (clk),
.reset (l2_reset),
.reset (socket_reset),
`ifdef PERF_ENABLE
.perf_cache_if (perf_l2cache_if),
.mem_perf_if (mem_perf_tmp_if),
`endif
.dcr_bus_if (socket_dcr_bus_if),
.mem_bus_if (per_socket_mem_bus_if[i]),
`ifdef GBAR_ENABLE
.gbar_bus_if (per_socket_gbar_bus_if[i]),
`endif
// Core request
.core_req_valid (per_core_mem_req_valid),
.core_req_rw (per_core_mem_req_rw),
.core_req_byteen (per_core_mem_req_byteen),
.core_req_addr (per_core_mem_req_addr),
.core_req_data (per_core_mem_req_data),
.core_req_tag (per_core_mem_req_tag),
.core_req_ready (per_core_mem_req_ready),
// Core response
.core_rsp_valid (per_core_mem_rsp_valid),
.core_rsp_data (per_core_mem_rsp_data),
.core_rsp_tag (per_core_mem_rsp_tag),
.core_rsp_ready (per_core_mem_rsp_ready),
`UNUSED_PIN (core_rsp_tmask),
// Memory request
.mem_req_valid (mem_req_valid),
.mem_req_rw (mem_req_rw),
.mem_req_byteen (mem_req_byteen),
.mem_req_addr (mem_req_addr),
.mem_req_data (mem_req_data),
.mem_req_tag (mem_req_tag),
.mem_req_ready (mem_req_ready),
// Memory response
.mem_rsp_valid (mem_rsp_valid),
.mem_rsp_tag (mem_rsp_tag),
.mem_rsp_data (mem_rsp_data),
.mem_rsp_ready (mem_rsp_ready)
.sim_ebreak (per_socket_sim_ebreak[i]),
.sim_wb_value (per_socket_sim_wb_value[i]),
.busy (per_socket_busy[i])
);
end else begin
`RESET_RELAY (mem_arb_reset);
VX_mem_arb #(
.NUM_REQS (`NUM_CORES),
.DATA_WIDTH (`DCACHE_MEM_DATA_WIDTH),
.ADDR_WIDTH (`DCACHE_MEM_ADDR_WIDTH),
.TAG_IN_WIDTH (`L1_MEM_TAG_WIDTH),
.TYPE ("R"),
.TAG_SEL_IDX (1), // Skip 0 for NC flag
.BUFFERED_REQ (1),
.BUFFERED_RSP (1)
) mem_arb (
.clk (clk),
.reset (mem_arb_reset),
// Core request
.req_valid_in (per_core_mem_req_valid),
.req_rw_in (per_core_mem_req_rw),
.req_byteen_in (per_core_mem_req_byteen),
.req_addr_in (per_core_mem_req_addr),
.req_data_in (per_core_mem_req_data),
.req_tag_in (per_core_mem_req_tag),
.req_ready_in (per_core_mem_req_ready),
// Memory request
.req_valid_out (mem_req_valid),
.req_rw_out (mem_req_rw),
.req_byteen_out (mem_req_byteen),
.req_addr_out (mem_req_addr),
.req_data_out (mem_req_data),
.req_tag_out (mem_req_tag),
.req_ready_out (mem_req_ready),
// Core response
.rsp_valid_out (per_core_mem_rsp_valid),
.rsp_data_out (per_core_mem_rsp_data),
.rsp_tag_out (per_core_mem_rsp_tag),
.rsp_ready_out (per_core_mem_rsp_ready),
// Memory response
.rsp_valid_in (mem_rsp_valid),
.rsp_tag_in (mem_rsp_tag),
.rsp_data_in (mem_rsp_data),
.rsp_ready_in (mem_rsp_ready)
);
end
`BUFFER_EX(busy, (| per_socket_busy), 1'b1, (`NUM_SOCKETS > 1));
endmodule

View File

@@ -1,138 +0,0 @@
`include "VX_define.vh"
module VX_commit #(
parameter CORE_ID = 0
) (
input wire clk,
input wire reset,
// inputs
VX_commit_if.slave alu_commit_if,
VX_commit_if.slave ld_commit_if,
VX_commit_if.slave st_commit_if,
VX_commit_if.slave csr_commit_if,
`ifdef EXT_F_ENABLE
VX_commit_if.slave fpu_commit_if,
`endif
VX_commit_if.slave gpu_commit_if,
// outputs
VX_writeback_if.master writeback_if,
VX_cmt_to_csr_if.master cmt_to_csr_if
);
// CSRs update
wire alu_commit_fire = alu_commit_if.valid && alu_commit_if.ready;
wire ld_commit_fire = ld_commit_if.valid && ld_commit_if.ready;
wire st_commit_fire = st_commit_if.valid && st_commit_if.ready;
wire csr_commit_fire = csr_commit_if.valid && csr_commit_if.ready;
`ifdef EXT_F_ENABLE
wire fpu_commit_fire = fpu_commit_if.valid && fpu_commit_if.ready;
`endif
wire gpu_commit_fire = gpu_commit_if.valid && gpu_commit_if.ready;
wire commit_fire = alu_commit_fire
|| ld_commit_fire
|| st_commit_fire
|| csr_commit_fire
`ifdef EXT_F_ENABLE
|| fpu_commit_fire
`endif
|| gpu_commit_fire;
`ifdef EXT_F_ENABLE
wire [(6*`NUM_THREADS)-1:0] commit_tmask;
`else
wire [(5*`NUM_THREADS)-1:0] commit_tmask;
`endif
wire [$clog2($bits(commit_tmask)+1)-1:0] commit_size;
assign commit_tmask = {
{`NUM_THREADS{alu_commit_fire}} & alu_commit_if.tmask,
{`NUM_THREADS{ld_commit_fire}} & ld_commit_if.tmask,
{`NUM_THREADS{st_commit_fire}} & st_commit_if.tmask,
{`NUM_THREADS{csr_commit_fire}} & csr_commit_if.tmask,
`ifdef EXT_F_ENABLE
{`NUM_THREADS{fpu_commit_fire}} & fpu_commit_if.tmask,
`endif
{`NUM_THREADS{gpu_commit_fire}} & gpu_commit_if.tmask
};
`POP_COUNT(commit_size, commit_tmask);
VX_pipe_register #(
.DATAW (1 + $bits(commit_size)),
.RESETW (1)
) pipe_reg (
.clk (clk),
.reset (reset),
.enable (1'b1),
.data_in ({commit_fire, commit_size}),
.data_out ({cmt_to_csr_if.valid, cmt_to_csr_if.commit_size})
);
// Writeback
VX_writeback #(
.CORE_ID(CORE_ID)
) writeback (
.clk (clk),
.reset (reset),
.alu_commit_if (alu_commit_if),
.ld_commit_if (ld_commit_if),
.csr_commit_if (csr_commit_if),
`ifdef EXT_F_ENABLE
.fpu_commit_if (fpu_commit_if),
`endif
.gpu_commit_if (gpu_commit_if),
.writeback_if (writeback_if)
);
// store and gpu commits don't writeback
assign st_commit_if.ready = 1'b1;
`ifdef DBG_TRACE_CORE_PIPELINE
always @(posedge clk) begin
if (alu_commit_if.valid && alu_commit_if.ready) begin
dpi_trace("%d: core%0d-commit: wid=%0d, PC=%0h, ex=ALU, tmask=%b, wb=%0d, rd=%0d, data=", $time, CORE_ID, alu_commit_if.wid, alu_commit_if.PC, alu_commit_if.tmask, alu_commit_if.wb, alu_commit_if.rd);
`TRACE_ARRAY1D(alu_commit_if.data, `NUM_THREADS);
dpi_trace(" (#%0d)\n", alu_commit_if.uuid);
end
if (ld_commit_if.valid && ld_commit_if.ready) begin
dpi_trace("%d: core%0d-commit: wid=%0d, PC=%0h, ex=LSU, tmask=%b, wb=%0d, rd=%0d, data=", $time, CORE_ID, ld_commit_if.wid, ld_commit_if.PC, ld_commit_if.tmask, ld_commit_if.wb, ld_commit_if.rd);
`TRACE_ARRAY1D(ld_commit_if.data, `NUM_THREADS);
dpi_trace(" (#%0d)\n", ld_commit_if.uuid);
end
if (st_commit_if.valid && st_commit_if.ready) begin
dpi_trace("%d: core%0d-commit: wid=%0d, PC=%0h, ex=LSU, tmask=%b, wb=%0d, rd=%0d (#%0d)\n", $time, CORE_ID, st_commit_if.wid, st_commit_if.PC, st_commit_if.tmask, st_commit_if.wb, st_commit_if.rd, st_commit_if.uuid);
end
if (csr_commit_if.valid && csr_commit_if.ready) begin
dpi_trace("%d: core%0d-commit: wid=%0d, PC=%0h, ex=CSR, tmask=%b, wb=%0d, rd=%0d, data=", $time, CORE_ID, csr_commit_if.wid, csr_commit_if.PC, csr_commit_if.tmask, csr_commit_if.wb, csr_commit_if.rd);
`TRACE_ARRAY1D(csr_commit_if.data, `NUM_THREADS);
dpi_trace(" (#%0d)\n", csr_commit_if.uuid);
end
`ifdef EXT_F_ENABLE
if (fpu_commit_if.valid && fpu_commit_if.ready) begin
dpi_trace("%d: core%0d-commit: wid=%0d, PC=%0h, ex=FPU, tmask=%b, wb=%0d, rd=%0d, data=", $time, CORE_ID, fpu_commit_if.wid, fpu_commit_if.PC, fpu_commit_if.tmask, fpu_commit_if.wb, fpu_commit_if.rd);
`TRACE_ARRAY1D(fpu_commit_if.data, `NUM_THREADS);
dpi_trace(" (#%0d)\n", fpu_commit_if.uuid);
end
`endif
if (gpu_commit_if.valid && gpu_commit_if.ready) begin
dpi_trace("%d: core%0d-commit: wid=%0d, PC=%0h, ex=GPU, tmask=%b, wb=%0d, rd=%0d, data=", $time, CORE_ID, gpu_commit_if.wid, gpu_commit_if.PC, gpu_commit_if.tmask, gpu_commit_if.wb, gpu_commit_if.rd);
`TRACE_ARRAY1D(gpu_commit_if.data, `NUM_THREADS);
dpi_trace(" (#%0d)\n", gpu_commit_if.uuid);
end
end
`endif
endmodule

View File

@@ -1,10 +1,88 @@
`ifndef VX_CONFIG
`define VX_CONFIG
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
`ifndef XLEN
`ifndef VX_CONFIG_VH
`define VX_CONFIG_VH
`ifndef MIN
`define MIN(x, y) (((x) < (y)) ? (x) : (y))
`endif
`ifndef MAX
`define MAX(x, y) (((x) > (y)) ? (x) : (y))
`endif
`ifndef CLAMP
`define CLAMP(x, lo, hi) (((x) > (hi)) ? (hi) : (((x) < (lo)) ? (lo) : (x)))
`endif
`ifndef UP
`define UP(x) (((x) != 0) ? (x) : 1)
`endif
///////////////////////////////////////////////////////////////////////////////
`ifndef EXT_M_DISABLE
`define EXT_M_ENABLE
`endif
`ifndef EXT_F_DISABLE
`define EXT_F_ENABLE
`endif
// core-coupled tensor core
`ifndef EXT_T_DISABLE
`define EXT_T_ENABLE
// decoupled Hopper-style tensor core
// `ifndef EXT_T_HOPPER
// `define EXT_T_HOPPER
// `endif
`endif
`ifndef XLEN_32
`ifndef XLEN_64
`define XLEN_32
`endif
`endif
`ifdef XLEN_64
`define XLEN 64
`endif
`ifdef XLEN_32
`define XLEN 32
`endif
`ifdef EXT_D_ENABLE
`define FLEN_64
`else
`define FLEN_32
`endif
`ifdef FLEN_64
`define FLEN 64
`endif
`ifdef FLEN_32
`define FLEN 32
`endif
`ifdef XLEN_64
`ifdef FLEN_32
`define FPU_RV64F
`endif
`endif
`ifndef NUM_CLUSTERS
`define NUM_CLUSTERS 1
`endif
@@ -17,294 +95,364 @@
`define NUM_WARPS 4
`endif
`ifndef NUM_TENSOR_WARPS
`define NUM_TENSOR_WARPS 2
`endif
`define NUM_SCALAR_WARPS (`NUM_WARPS - `NUM_TENSOR_WARPS)
`define WU_CONFIG_STATIC_ASSERTS \
generate \
if (!(`NUM_WARPS > 0)) begin : g_wu_num_warps_gt_zero \
invalid_NUM_WARPS_must_be_greater_than_zero __wu_config_error(); \
end \
if (!(`NUM_TENSOR_WARPS > 0)) begin : g_wu_num_tensor_warps_gt_zero \
invalid_NUM_TENSOR_WARPS_must_be_greater_than_zero __wu_config_error(); \
end \
if (!(`NUM_TENSOR_WARPS < `NUM_WARPS)) begin : g_wu_num_tensor_warps_lt_num_warps \
invalid_NUM_TENSOR_WARPS_must_be_smaller_than_NUM_WARPS __wu_config_error(); \
end \
if (!(`NUM_SCALAR_WARPS > 0)) begin : g_wu_num_scalar_warps_gt_zero \
invalid_NUM_SCALAR_WARPS_must_be_greater_than_zero __wu_config_error(); \
end \
endgenerate
`define IS_SCALAR_WARP(wid) ((wid) < `NUM_SCALAR_WARPS)
`define IS_TENSOR_WARP(wid) ((wid) >= `NUM_SCALAR_WARPS)
`ifndef TENSOR_NUM_GPRS
`define TENSOR_NUM_GPRS 8
`endif
`ifndef TENSOR_NUM_FPRS
`define TENSOR_NUM_FPRS 8
`endif
`ifndef NUM_THREADS
`define NUM_THREADS 4
`endif
`ifndef NUM_BARRIERS
`define NUM_BARRIERS 4
`define NUM_BARRIERS 8
`endif
`ifndef L2_ENABLE
`define L2_ENABLE 0
`ifndef SOCKET_SIZE
`define SOCKET_SIZE `MIN(4, `NUM_CORES)
`endif
`define NUM_SOCKETS `UP(`NUM_CORES / `SOCKET_SIZE)
`ifdef L2_ENABLE
`define L2_ENABLED 1
`else
`define L2_ENABLED 0
`endif
`ifndef L3_ENABLE
`define L3_ENABLE 0
`ifdef L3_ENABLE
`define L3_ENABLED 1
`else
`define L3_ENABLED 0
`endif
`ifndef SM_ENABLE
`define SM_ENABLE 1
`ifdef L1_DISABLE
`define ICACHE_DISABLE
`define DCACHE_DISABLE
`endif
`ifndef MEM_BLOCK_SIZE
`define MEM_BLOCK_SIZE 64
`endif
`ifndef L1_BLOCK_SIZE
`define L1_BLOCK_SIZE ((`L2_ENABLE || `L3_ENABLE) ? 16 : `MEM_BLOCK_SIZE)
`ifndef MEM_ADDR_WIDTH
`ifdef XLEN_64
`define MEM_ADDR_WIDTH 48
`else
`define MEM_ADDR_WIDTH 32
`endif
`endif
`ifndef L1_LINE_SIZE
`ifdef L1_DISABLE
`define L1_LINE_SIZE ((`L2_ENABLED || `L3_ENABLED) ? 4 : `MEM_BLOCK_SIZE)
`else
`define L1_LINE_SIZE ((`L2_ENABLED || `L3_ENABLED) ? 16 : `MEM_BLOCK_SIZE)
`endif
`endif
`ifdef L2_ENABLE
`define L2_LINE_SIZE `MEM_BLOCK_SIZE
`else
`define L2_LINE_SIZE `L1_LINE_SIZE
`endif
`ifdef L3_ENABLE
`define L3_LINE_SIZE `MEM_BLOCK_SIZE
`else
`define L3_LINE_SIZE `L2_LINE_SIZE
`endif
`ifdef XLEN_64
`ifndef STARTUP_ADDR
`define STARTUP_ADDR 64'h180000000
`endif
`ifndef STACK_BASE_ADDR
`define STACK_BASE_ADDR 64'h1FF000000
`endif
`else
`ifndef STARTUP_ADDR
`define STARTUP_ADDR 32'h80000000
`endif
`ifndef IO_BASE_ADDR
`define IO_BASE_ADDR 32'hFF000000
`ifndef STACK_BASE_ADDR
`define STACK_BASE_ADDR 32'hFF000000
`endif
`ifndef IO_ADDR_SIZE
`define IO_ADDR_SIZE (32'hFFFFFFFF - `IO_BASE_ADDR + 1)
`endif
`ifndef IO_COUT_ADDR
`define IO_COUT_ADDR (32'hFFFFFFFF - `MEM_BLOCK_SIZE + 1)
`endif
`ifndef IO_COUT_SIZE
`define IO_COUT_SIZE `MEM_BLOCK_SIZE
`endif
`ifndef IO_CSR_ADDR
`define IO_CSR_ADDR `IO_BASE_ADDR
`endif
`ifndef SMEM_BASE_ADDR
`define SMEM_BASE_ADDR `IO_BASE_ADDR
`define SMEM_BASE_ADDR `STACK_BASE_ADDR
`endif
`ifndef EXT_M_DISABLE
`define EXT_M_ENABLE
`ifndef SMEM_LOG_SIZE
`define SMEM_LOG_SIZE 19
`endif
`ifndef EXT_F_DISABLE
`define EXT_F_ENABLE
`ifndef IO_BASE_ADDR
`define IO_BASE_ADDR (`SMEM_BASE_ADDR + (1 << `SMEM_LOG_SIZE))
`endif
// Device identification
`define VENDOR_ID 0
`define ARCHITECTURE_ID 0
`define IMPLEMENTATION_ID 0
`ifndef IO_COUT_ADDR
`define IO_COUT_ADDR `IO_BASE_ADDR
`endif
`define IO_COUT_SIZE `MEM_BLOCK_SIZE
///////////////////////////////////////////////////////////////////////////////
`ifndef IO_CSR_ADDR
`define IO_CSR_ADDR (`IO_COUT_ADDR + `IO_COUT_SIZE)
`endif
`define IO_CSR_SIZE (4 * 64 * `NUM_CORES * `NUM_CLUSTERS)
`ifndef LATENCY_IMUL
`define LATENCY_IMUL 3
`ifndef STACK_LOG2_SIZE
`define STACK_LOG2_SIZE 13
`endif
`define STACK_SIZE (1 << `STACK_LOG2_SIZE)
`define RESET_DELAY 8
`ifndef STALL_TIMEOUT
`define STALL_TIMEOUT (100000 * (1 ** (`L2_ENABLED + `L3_ENABLED)))
`endif
`ifndef LATENCY_FNCP
`define LATENCY_FNCP 2
`ifndef SV_DPI
`define DPI_DISABLE
`endif
`ifndef LATENCY_FMA
`define LATENCY_FMA 4
`endif
`ifndef LATENCY_FDIV
`ifdef ALTERA_S10
`define LATENCY_FDIV 34
`ifndef FPU_FPNEW
`ifndef FPU_DSP
`ifndef FPU_DPI
`ifndef SYNTHESIS
`ifndef DPI_DISABLE
`define FPU_DPI
`else
`define LATENCY_FDIV 15
`define FPU_DSP
`endif
`endif
`ifndef LATENCY_FSQRT
`ifdef ALTERA_S10
`define LATENCY_FSQRT 25
`else
`define LATENCY_FSQRT 10
`define FPU_DSP
`endif
`endif
`endif
`endif
`ifndef LATENCY_FDIVSQRT
`define LATENCY_FDIVSQRT 32
`ifndef SYNTHESIS
`ifndef DPI_DISABLE
`define IMUL_DPI
`define IDIV_DPI
`endif
`endif
`ifndef LATENCY_FCVT
`define LATENCY_FCVT 5
`ifndef DEBUG_LEVEL
`define DEBUG_LEVEL 3
`endif
`define RESET_DELAY 6
// Pipeline Configuration /////////////////////////////////////////////////////
// CSR Addresses //////////////////////////////////////////////////////////////
// Issue width
`ifndef ISSUE_WIDTH
`define ISSUE_WIDTH `NUM_WARPS
`endif
// User Floating-Point CSRs
`define CSR_FFLAGS 12'h001
`define CSR_FRM 12'h002
`define CSR_FCSR 12'h003
// Number of ALU units
`ifndef NUM_ALU_LANES
`define NUM_ALU_LANES `NUM_THREADS
`endif
`ifndef NUM_ALU_BLOCKS
`define NUM_ALU_BLOCKS 2
`endif
`define CSR_SATP 12'h180
// Number of FPU units
`ifndef NUM_FPU_LANES
`define NUM_FPU_LANES `NUM_THREADS
`endif
`ifndef NUM_FPU_BLOCKS
`define NUM_FPU_BLOCKS 1
`endif
`define CSR_PMPCFG0 12'h3A0
`define CSR_PMPADDR0 12'h3B0
// Number of LSU units
`ifndef NUM_LSU_LANES
`define NUM_LSU_LANES `NUM_THREADS
`endif
`define CSR_MSTATUS 12'h300
`define CSR_MISA 12'h301
`define CSR_MEDELEG 12'h302
`define CSR_MIDELEG 12'h303
`define CSR_MIE 12'h304
`define CSR_MTVEC 12'h305
`define CSR_MEPC 12'h341
// Machine Performance-monitoring counters
`define CSR_MPM_BASE 12'hB00
`define CSR_MPM_BASE_H 12'hB80
// PERF: pipeline
`define CSR_MCYCLE 12'hB00
`define CSR_MCYCLE_H 12'hB80
`define CSR_MPM_RESERVED 12'hB01
`define CSR_MPM_RESERVED_H 12'hB81
`define CSR_MINSTRET 12'hB02
`define CSR_MINSTRET_H 12'hB82
`define CSR_MPM_IBUF_ST 12'hB03
`define CSR_MPM_IBUF_ST_H 12'hB83
`define CSR_MPM_SCRB_ST 12'hB04
`define CSR_MPM_SCRB_ST_H 12'hB84
`define CSR_MPM_ALU_ST 12'hB05
`define CSR_MPM_ALU_ST_H 12'hB85
`define CSR_MPM_LSU_ST 12'hB06
`define CSR_MPM_LSU_ST_H 12'hB86
`define CSR_MPM_CSR_ST 12'hB07
`define CSR_MPM_CSR_ST_H 12'hB87
`define CSR_MPM_FPU_ST 12'hB08
`define CSR_MPM_FPU_ST_H 12'hB88
`define CSR_MPM_GPU_ST 12'hB09
`define CSR_MPM_GPU_ST_H 12'hB89
// PERF: decode
`define CSR_MPM_LOADS 12'hB0A
`define CSR_MPM_LOADS_H 12'hB8A
`define CSR_MPM_STORES 12'hB0B
`define CSR_MPM_STORES_H 12'hB8B
`define CSR_MPM_BRANCHES 12'hB0C
`define CSR_MPM_BRANCHES_H 12'hB8C
// PERF: icache
`define CSR_MPM_ICACHE_READS 12'hB0D // total reads
`define CSR_MPM_ICACHE_READS_H 12'hB8D
`define CSR_MPM_ICACHE_MISS_R 12'hB0E // read misses
`define CSR_MPM_ICACHE_MISS_R_H 12'hB8E
// PERF: dcache
`define CSR_MPM_DCACHE_READS 12'hB0F // total reads
`define CSR_MPM_DCACHE_READS_H 12'hB8F
`define CSR_MPM_DCACHE_WRITES 12'hB10 // total writes
`define CSR_MPM_DCACHE_WRITES_H 12'hB90
`define CSR_MPM_DCACHE_MISS_R 12'hB11 // read misses
`define CSR_MPM_DCACHE_MISS_R_H 12'hB91
`define CSR_MPM_DCACHE_MISS_W 12'hB12 // write misses
`define CSR_MPM_DCACHE_MISS_W_H 12'hB92
`define CSR_MPM_DCACHE_BANK_ST 12'hB13 // bank conflicts
`define CSR_MPM_DCACHE_BANK_ST_H 12'hB93
`define CSR_MPM_DCACHE_MSHR_ST 12'hB14 // MSHR stalls
`define CSR_MPM_DCACHE_MSHR_ST_H 12'hB94
// PERF: smem
`define CSR_MPM_SMEM_READS 12'hB15 // total reads
`define CSR_MPM_SMEM_READS_H 12'hB95
`define CSR_MPM_SMEM_WRITES 12'hB16 // total writes
`define CSR_MPM_SMEM_WRITES_H 12'hB96
`define CSR_MPM_SMEM_BANK_ST 12'hB17 // bank conflicts
`define CSR_MPM_SMEM_BANK_ST_H 12'hB97
// PERF: memory
`define CSR_MPM_MEM_READS 12'hB18 // memory reads
`define CSR_MPM_MEM_READS_H 12'hB98
`define CSR_MPM_MEM_WRITES 12'hB19 // memory writes
`define CSR_MPM_MEM_WRITES_H 12'hB99
`define CSR_MPM_MEM_LAT 12'hB1A // memory latency
`define CSR_MPM_MEM_LAT_H 12'hB9A
// PERF: texunit
`define CSR_MPM_TEX_READS 12'hB1B // texture accesses
`define CSR_MPM_TEX_READS_H 12'hB9B
`define CSR_MPM_TEX_LAT 12'hB1C // texture latency
`define CSR_MPM_TEX_LAT_H 12'hB9C
// Machine Information Registers
`define CSR_MVENDORID 12'hF11
`define CSR_MARCHID 12'hF12
`define CSR_MIMPID 12'hF13
`define CSR_MHARTID 12'hF14
// User SIMT CSRs
`define CSR_WTID 12'hCC0
`define CSR_LTID 12'hCC1
`define CSR_GTID 12'hCC2
`define CSR_LWID 12'hCC3
`define CSR_GWID `CSR_MHARTID
`define CSR_GCID 12'hCC5
`define CSR_TMASK 12'hCC4
// Machine SIMT CSRs
`define CSR_NT 12'hFC0
`define CSR_NW 12'hFC1
`define CSR_NC 12'hFC2
////////// Texture Units //////////////////////////////////////////////////////
`define NUM_TEX_UNITS 2
`define TEX_SUBPIXEL_BITS 8
`define TEX_DIM_BITS 15
`define TEX_LOD_MAX `TEX_DIM_BITS
`define TEX_LOD_BITS 4
`define TEX_FXD_BITS 32
`define TEX_FXD_FRAC (`TEX_DIM_BITS+`TEX_SUBPIXEL_BITS)
`define TEX_STATE_ADDR 0
`define TEX_STATE_WIDTH 1
`define TEX_STATE_HEIGHT 2
`define TEX_STATE_FORMAT 3
`define TEX_STATE_FILTER 4
`define TEX_STATE_WRAPU 5
`define TEX_STATE_WRAPV 6
`define TEX_STATE_MIPOFF(lod) (7+(lod))
`define NUM_TEX_STATES (`TEX_STATE_MIPOFF(`TEX_LOD_MAX)+1)
`define CSR_TEX_UNIT 12'hFD0
`define CSR_TEX_STATE_BEGIN 12'hFD1
`define CSR_TEX_ADDR (`CSR_TEX_STATE_BEGIN+`TEX_STATE_ADDR)
`define CSR_TEX_WIDTH (`CSR_TEX_STATE_BEGIN+`TEX_STATE_WIDTH)
`define CSR_TEX_HEIGHT (`CSR_TEX_STATE_BEGIN+`TEX_STATE_HEIGHT)
`define CSR_TEX_FORMAT (`CSR_TEX_STATE_BEGIN+`TEX_STATE_FORMAT)
`define CSR_TEX_FILTER (`CSR_TEX_STATE_BEGIN+`TEX_STATE_FILTER)
`define CSR_TEX_WRAPU (`CSR_TEX_STATE_BEGIN+`TEX_STATE_WRAPU)
`define CSR_TEX_WRAPV (`CSR_TEX_STATE_BEGIN+`TEX_STATE_WRAPV)
`define CSR_TEX_MIPOFF(lod) (`CSR_TEX_STATE_BEGIN+`TEX_STATE_MIPOFF(lod))
`define CSR_TEX_STATE_END (`CSR_TEX_STATE_BEGIN+`NUM_TEX_STATES)
`define CSR_TEX_STATE(addr) ((addr) - `CSR_TEX_STATE_BEGIN)
// Pipeline Queues ////////////////////////////////////////////////////////////
// Number of SFU units
`ifndef NUM_SFU_LANES
`define NUM_SFU_LANES `MIN(`NUM_THREADS, 4)
`endif
// Size of Instruction Buffer
`ifndef IBUF_SIZE
`define IBUF_SIZE 2
`define IBUF_SIZE (4 * `ISSUE_WIDTH)
`endif
// Size of LSU Request Queue
`ifndef LSUQ_SIZE
`define LSUQ_SIZE (`NUM_WARPS * 2)
`define LSUQ_SIZE (4 * `NUM_WARPS * (`NUM_THREADS / `NUM_LSU_LANES))
`endif
// LSU Duplicate Address Check
`ifndef LSU_DUP_DISABLE
`define LSU_DUP_ENABLE
`endif
`ifdef LSU_DUP_ENABLE
`define LSU_DUP_ENABLED 1
`else
`define LSU_DUP_ENABLED 0
`endif
`ifdef GBAR_ENABLE
`define GBAR_ENABLED 1
`else
`define GBAR_ENABLED 0
`endif
`ifndef LATENCY_IMUL
`ifdef VIVADO
`define LATENCY_IMUL 4
`endif
`ifdef QUARTUS
`define LATENCY_IMUL 3
`endif
`ifndef LATENCY_IMUL
`define LATENCY_IMUL 4
`endif
`endif
// Floating-Point Units ///////////////////////////////////////////////////////
// Size of FPU Request Queue
`ifndef FPUQ_SIZE
`define FPUQ_SIZE 8
`define FPUQ_SIZE (8 * (`NUM_THREADS / `NUM_FPU_LANES))
`endif
// Texture Unit Request Queue
`ifndef TEXQ_SIZE
`define TEXQ_SIZE (`NUM_WARPS * 2)
// FNCP Latency
`ifndef LATENCY_FNCP
`define LATENCY_FNCP 2
`endif
// FMA Latency
`ifndef LATENCY_FMA
`ifdef FPU_DPI
`define LATENCY_FMA 4
`endif
`ifdef FPU_FPNEW
`define LATENCY_FMA 4
`endif
`ifdef FPU_DSP
`ifdef QUARTUS
`define LATENCY_FMA 4
`endif
`ifdef VIVADO
`define LATENCY_FMA 16
`endif
`ifndef LATENCY_FMA
`define LATENCY_FMA 4
`endif
`endif
`endif
// FDIV Latency
`ifndef LATENCY_FDIV
`ifdef FPU_DPI
`define LATENCY_FDIV 15
`endif
`ifdef FPU_FPNEW
`define LATENCY_FDIV 16
`endif
`ifdef FPU_DSP
`ifdef QUARTUS
`define LATENCY_FDIV 15
`endif
`ifdef VIVADO
`define LATENCY_FDIV 28
`endif
`ifndef LATENCY_FDIV
`define LATENCY_FDIV 16
`endif
`endif
`endif
// FSQRT Latency
`ifndef LATENCY_FSQRT
`ifdef FPU_DPI
`define LATENCY_FSQRT 10
`endif
`ifdef FPU_FPNEW
`define LATENCY_FSQRT 16
`endif
`ifdef FPU_DSP
`ifdef QUARTUS
`define LATENCY_FSQRT 10
`endif
`ifdef VIVADO
`define LATENCY_FSQRT 28
`endif
`ifndef LATENCY_FSQRT
`define LATENCY_FSQRT 16
`endif
`endif
`endif
// FCVT Latency
`ifndef LATENCY_FCVT
`define LATENCY_FCVT 5
`endif
// Tensor Core Latency
`ifndef LATENCY_HMMA
`define LATENCY_HMMA 4
`endif
// Icache Configurable Knobs //////////////////////////////////////////////////
// Size of cache in bytes
`ifndef ICACHE_SIZE
`define ICACHE_SIZE 16384
// Cache Enable
`ifndef ICACHE_DISABLE
`define ICACHE_ENABLE
`endif
`ifdef ICACHE_ENABLE
`define ICACHE_ENABLED 1
`else
`define ICACHE_ENABLED 0
`define NUM_ICACHES 0
`endif
// Core Request Queue Size
`ifndef ICACHE_CREQ_SIZE
`define ICACHE_CREQ_SIZE 0
// Number of Cache Units
`ifndef NUM_ICACHES
`define NUM_ICACHES `UP(`SOCKET_SIZE / 4)
`endif
// Cache Size
`ifndef ICACHE_SIZE
`define ICACHE_SIZE 16384
`endif
// Core Response Queue Size
@@ -314,7 +462,7 @@
// Miss Handling Register Size
`ifndef ICACHE_MSHR_SIZE
`define ICACHE_MSHR_SIZE `NUM_WARPS
`define ICACHE_MSHR_SIZE 16
`endif
// Memory Request Queue Size
@@ -327,26 +475,38 @@
`define ICACHE_MRSQ_SIZE 0
`endif
// Number of Associative Ways
`ifndef ICACHE_NUM_WAYS
`define ICACHE_NUM_WAYS 1
`endif
// Dcache Configurable Knobs //////////////////////////////////////////////////
// Size of cache in bytes
// Cache Enable
`ifndef DCACHE_DISABLE
`define DCACHE_ENABLE
`endif
`ifdef DCACHE_ENABLE
`define DCACHE_ENABLED 1
`else
`define DCACHE_ENABLED 0
`define NUM_DCACHES 0
`define DCACHE_NUM_BANKS 1
`endif
// Number of Cache Units
`ifndef NUM_DCACHES
`define NUM_DCACHES `UP(`SOCKET_SIZE / 4)
`endif
// Cache Size
`ifndef DCACHE_SIZE
`define DCACHE_SIZE 16384
`endif
// Number of banks
// Number of Banks
`ifndef DCACHE_NUM_BANKS
`define DCACHE_NUM_BANKS `NUM_THREADS
`endif
// Number of ports per bank
`ifndef DCACHE_NUM_PORTS
`define DCACHE_NUM_PORTS 1
`endif
// Core Request Queue Size
`ifndef DCACHE_CREQ_SIZE
`define DCACHE_CREQ_SIZE 0
`define DCACHE_NUM_BANKS `NUM_LSU_LANES
`endif
// Core Response Queue Size
@@ -356,7 +516,7 @@
// Miss Handling Register Size
`ifndef DCACHE_MSHR_SIZE
`define DCACHE_MSHR_SIZE `LSUQ_SIZE
`define DCACHE_MSHR_SIZE 8
`endif
// Memory Request Queue Size
@@ -369,54 +529,43 @@
`define DCACHE_MRSQ_SIZE 0
`endif
// Number of Associative Ways
`ifndef DCACHE_NUM_WAYS
`define DCACHE_NUM_WAYS 1
`endif
// SM Configurable Knobs //////////////////////////////////////////////////////
// per thread stack size
`ifndef STACK_LOG2_SIZE
`define STACK_LOG2_SIZE 10
`endif
`define STACK_SIZE (1 << `STACK_LOG2_SIZE)
// Size of cache in bytes
`ifndef SMEM_SIZE
`define SMEM_SIZE (`STACK_SIZE * `NUM_WARPS * `NUM_THREADS)
`ifndef SM_DISABLE
`define SM_ENABLE
`endif
// Number of banks
`ifdef SM_ENABLE
`define SM_ENABLED 1
`else
`define SM_ENABLED 0
`define SMEM_NUM_BANKS 1
`endif
// Number of Banks
`ifndef SMEM_NUM_BANKS
`define SMEM_NUM_BANKS `NUM_THREADS
`endif
// Core Request Queue Size
`ifndef SMEM_CREQ_SIZE
`define SMEM_CREQ_SIZE 2
`endif
// Core Response Queue Size
`ifndef SMEM_CRSQ_SIZE
`define SMEM_CRSQ_SIZE 2
`define SMEM_NUM_BANKS (`NUM_LSU_LANES)
`endif
// L2cache Configurable Knobs /////////////////////////////////////////////////
// Size of cache in bytes
// Cache Size
`ifndef L2_CACHE_SIZE
`define L2_CACHE_SIZE 131072
`ifdef ALTERA_S10
`define L2_CACHE_SIZE 2097152
`else
`define L2_CACHE_SIZE 1048576
`endif
`endif
// Number of banks
// Number of Banks
`ifndef L2_NUM_BANKS
`define L2_NUM_BANKS ((`NUM_CORES < 4) ? `NUM_CORES : 4)
`endif
// Number of ports per bank
`ifndef L2_NUM_PORTS
`define L2_NUM_PORTS 1
`endif
// Core Request Queue Size
`ifndef L2_CREQ_SIZE
`define L2_CREQ_SIZE 0
`define L2_NUM_BANKS `MIN(4, `NUM_SOCKETS)
`endif
// Core Response Queue Size
@@ -439,26 +588,25 @@
`define L2_MRSQ_SIZE 0
`endif
// Number of Associative Ways
`ifndef L2_NUM_WAYS
`define L2_NUM_WAYS 2
`endif
// L3cache Configurable Knobs /////////////////////////////////////////////////
// Size of cache in bytes
// Cache Size
`ifndef L3_CACHE_SIZE
`ifdef ALTERA_S10
`define L3_CACHE_SIZE 2097152
`else
`define L3_CACHE_SIZE 1048576
`endif
`endif
// Number of banks
// Number of Banks
`ifndef L3_NUM_BANKS
`define L3_NUM_BANKS ((`NUM_CLUSTERS < 4) ? `NUM_CORES : 4)
`endif
// Number of ports per bank
`ifndef L3_NUM_PORTS
`define L3_NUM_PORTS 1
`endif
// Core Request Queue Size
`ifndef L3_CREQ_SIZE
`define L3_CREQ_SIZE 0
`define L3_NUM_BANKS `MIN(4, `NUM_CLUSTERS)
`endif
// Core Response Queue Size
@@ -481,4 +629,104 @@
`define L3_MRSQ_SIZE 0
`endif
`endif
// Number of Associative Ways
`ifndef L3_NUM_WAYS
`define L3_NUM_WAYS 4
`endif
// ISA Extensions /////////////////////////////////////////////////////////////
`ifdef EXT_A_ENABLE
`define EXT_A_ENABLED 1
`else
`define EXT_A_ENABLED 0
`endif
`ifdef EXT_C_ENABLE
`define EXT_C_ENABLED 1
`else
`define EXT_C_ENABLED 0
`endif
`ifdef EXT_D_ENABLE
`define EXT_D_ENABLED 1
`else
`define EXT_D_ENABLED 0
`endif
`ifdef EXT_F_ENABLE
`define EXT_F_ENABLED 1
`else
`define EXT_F_ENABLED 0
`endif
`ifdef EXT_T_ENABLE
`define EXT_T_ENABLED 1
`else
`define EXT_T_ENABLED 0
`endif
`ifdef EXT_M_ENABLE
`define EXT_M_ENABLED 1
`else
`define EXT_M_ENABLED 0
`endif
`define ISA_STD_A 0
`define ISA_STD_C 2
`define ISA_STD_D 3
`define ISA_STD_E 4
`define ISA_STD_F 5
`define ISA_STD_H 7
`define ISA_STD_I 8
`define ISA_STD_N 13
`define ISA_STD_Q 16
`define ISA_STD_S 18
`define ISA_STD_U 20
`define ISA_EXT_ICACHE 0
`define ISA_EXT_DCACHE 1
`define ISA_EXT_L2CACHE 2
`define ISA_EXT_L3CACHE 3
`define ISA_EXT_SMEM 4
`define MISA_EXT (`ICACHE_ENABLED << `ISA_EXT_ICACHE) \
| (`DCACHE_ENABLED << `ISA_EXT_DCACHE) \
| (`L2_ENABLED << `ISA_EXT_L2CACHE) \
| (`L3_ENABLED << `ISA_EXT_L3CACHE) \
| (`SM_ENABLED << `ISA_EXT_SMEM)
`define MISA_STD (`EXT_A_ENABLED << 0) /* A - Atomic Instructions extension */ \
| (0 << 1) /* B - Tentatively reserved for Bit operations extension */ \
| (`EXT_C_ENABLED << 2) /* C - Compressed extension */ \
| (`EXT_D_ENABLED << 3) /* D - Double precsision floating-point extension */ \
| (0 << 4) /* E - RV32E base ISA */ \
| (`EXT_F_ENABLED << 5) /* F - Single precsision floating-point extension */ \
| (0 << 6) /* G - Additional standard extensions present */ \
| (0 << 7) /* H - Hypervisor mode implemented */ \
| (1 << 8) /* I - RV32I/64I/128I base ISA */ \
| (0 << 9) /* J - Reserved */ \
| (0 << 10) /* K - Reserved */ \
| (0 << 11) /* L - Tentatively reserved for Bit operations extension */ \
| (`EXT_M_ENABLED << 12) /* M - Integer Multiply/Divide extension */ \
| (0 << 13) /* N - User level interrupts supported */ \
| (0 << 14) /* O - Reserved */ \
| (0 << 15) /* P - Tentatively reserved for Packed-SIMD extension */ \
| (0 << 16) /* Q - Quad-precision floating-point extension */ \
| (0 << 17) /* R - Reserved */ \
| (0 << 18) /* S - Supervisor mode implemented */ \
| (0 << 19) /* T - Tentatively reserved for Transactional Memory extension */ \
| (1 << 20) /* U - User mode implemented */ \
| (0 << 21) /* V - Tentatively reserved for Vector extension */ \
| (0 << 22) /* W - Reserved */ \
| (1 << 23) /* X - Non-standard extensions present */ \
| (0 << 24) /* Y - Reserved */ \
| (0 << 25) /* Z - Reserved */
// Device identification //////////////////////////////////////////////////////
`define VENDOR_ID 0
`define ARCHITECTURE_ID 0
`define IMPLEMENTATION_ID 0
`endif // VX_CONFIG_VH

View File

@@ -1,156 +0,0 @@
`include "VX_define.vh"
module VX_core #(
parameter CORE_ID = 0
) (
`SCOPE_IO_VX_core
// Clock
input wire clk,
input wire reset,
// Memory request
output wire mem_req_valid,
output wire mem_req_rw,
output wire [`DCACHE_MEM_BYTEEN_WIDTH-1:0] mem_req_byteen,
output wire [`DCACHE_MEM_ADDR_WIDTH-1:0] mem_req_addr,
output wire [`DCACHE_MEM_DATA_WIDTH-1:0] mem_req_data,
output wire [`L1_MEM_TAG_WIDTH-1:0] mem_req_tag,
input wire mem_req_ready,
// Memory reponse
input wire mem_rsp_valid,
input wire [`DCACHE_MEM_DATA_WIDTH-1:0] mem_rsp_data,
input wire [`L1_MEM_TAG_WIDTH-1:0] mem_rsp_tag,
output wire mem_rsp_ready,
// Status
output wire busy
);
`ifdef PERF_ENABLE
VX_perf_memsys_if perf_memsys_if();
`endif
VX_mem_req_if #(
.DATA_WIDTH (`DCACHE_MEM_DATA_WIDTH),
.ADDR_WIDTH (`DCACHE_MEM_ADDR_WIDTH),
.TAG_WIDTH (`L1_MEM_TAG_WIDTH)
) mem_req_if();
VX_mem_rsp_if #(
.DATA_WIDTH (`DCACHE_MEM_DATA_WIDTH),
.TAG_WIDTH (`L1_MEM_TAG_WIDTH)
) mem_rsp_if();
assign mem_req_valid = mem_req_if.valid;
assign mem_req_rw = mem_req_if.rw;
assign mem_req_byteen= mem_req_if.byteen;
assign mem_req_addr = mem_req_if.addr;
assign mem_req_data = mem_req_if.data;
assign mem_req_tag = mem_req_if.tag;
assign mem_req_if.ready = mem_req_ready;
assign mem_rsp_if.valid = mem_rsp_valid;
assign mem_rsp_if.data = mem_rsp_data;
assign mem_rsp_if.tag = mem_rsp_tag;
assign mem_rsp_ready = mem_rsp_if.ready;
//--
VX_dcache_req_if #(
.NUM_REQS (`DCACHE_NUM_REQS),
.WORD_SIZE (`DCACHE_WORD_SIZE),
.TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH)
) dcache_req_if();
VX_dcache_rsp_if #(
.NUM_REQS (`DCACHE_NUM_REQS),
.WORD_SIZE (`DCACHE_WORD_SIZE),
.TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH)
) dcache_rsp_if();
VX_icache_req_if #(
.WORD_SIZE (`ICACHE_WORD_SIZE),
.TAG_WIDTH (`ICACHE_CORE_TAG_WIDTH)
) icache_req_if();
VX_icache_rsp_if #(
.WORD_SIZE (`ICACHE_WORD_SIZE),
.TAG_WIDTH (`ICACHE_CORE_TAG_WIDTH)
) icache_rsp_if();
VX_pipeline #(
.CORE_ID(CORE_ID)
) pipeline (
`SCOPE_BIND_VX_core_pipeline
`ifdef PERF_ENABLE
.perf_memsys_if (perf_memsys_if),
`endif
.clk(clk),
.reset(reset),
// Dcache core request
.dcache_req_valid (dcache_req_if.valid),
.dcache_req_rw (dcache_req_if.rw),
.dcache_req_byteen (dcache_req_if.byteen),
.dcache_req_addr (dcache_req_if.addr),
.dcache_req_data (dcache_req_if.data),
.dcache_req_tag (dcache_req_if.tag),
.dcache_req_ready (dcache_req_if.ready),
// Dcache core reponse
.dcache_rsp_valid (dcache_rsp_if.valid),
.dcache_rsp_tmask (dcache_rsp_if.tmask),
.dcache_rsp_data (dcache_rsp_if.data),
.dcache_rsp_tag (dcache_rsp_if.tag),
.dcache_rsp_ready (dcache_rsp_if.ready),
// Icache core request
.icache_req_valid (icache_req_if.valid),
.icache_req_addr (icache_req_if.addr),
.icache_req_tag (icache_req_if.tag),
.icache_req_ready (icache_req_if.ready),
// Icache core reponse
.icache_rsp_valid (icache_rsp_if.valid),
.icache_rsp_data (icache_rsp_if.data),
.icache_rsp_tag (icache_rsp_if.tag),
.icache_rsp_ready (icache_rsp_if.ready),
// Status
.busy(busy)
);
//--
VX_mem_unit #(
.CORE_ID(CORE_ID)
) mem_unit (
`SCOPE_BIND_VX_core_mem_unit
`ifdef PERF_ENABLE
.perf_memsys_if (perf_memsys_if),
`endif
.clk (clk),
.reset (reset),
// Core <-> Dcache
.dcache_req_if (dcache_req_if),
.dcache_rsp_if (dcache_rsp_if),
// Core <-> Icache
.icache_req_if (icache_req_if),
.icache_rsp_if (icache_rsp_if),
// Memory
.mem_req_if (mem_req_if),
.mem_rsp_if (mem_rsp_if)
);
endmodule

654
hw/rtl/VX_core_wrapper.sv Normal file
View File

@@ -0,0 +1,654 @@
`include "VX_define.vh"
`include "VX_gpu_pkg.sv"
// TODO: move VX_define constants to parameters, and then parameterize in blackbox
module Vortex import VX_gpu_pkg::*; #(
parameter CORE_ID = 0,
parameter TENSOR_FP16 = 0,
parameter logic [63:0] STARTUP_ADDR = 64'h0000_0000_0001_0100,
parameter NUM_THREADS = 0,
parameter NUM_TENSOR_CORES = 1,
parameter TC_DATA_WIDTH = 256,
parameter TC_TAG_WIDTH = 4
) (
/* adapt to CoreIO bundle at src/main/scala/tile/Core.scala */
input clock,
input reset,
// input hartid,
input [31:0] reset_vector,
input interrupts_debug,
input interrupts_mtip,
input interrupts_msip,
input interrupts_meip,
input interrupts_seip,
// imem ------------------------------------------------
input imem_0_a_ready,
input imem_0_d_valid,
input [2:0] imem_0_d_bits_opcode,
input [3:0] imem_0_d_bits_size,
input [ICACHE_TAG_WIDTH-1:0] imem_0_d_bits_source,
input [31:0] imem_0_d_bits_data,
output imem_0_a_valid,
output [2:0] imem_0_a_bits_opcode,
output [3:0] imem_0_a_bits_size,
output [ICACHE_TAG_WIDTH-1:0] imem_0_a_bits_source,
output [31:0] imem_0_a_bits_address,
output [3:0] imem_0_a_bits_mask,
output [31:0] imem_0_a_bits_data,
output imem_0_d_ready,
// dmem ------------------------------------------------
input [DCACHE_NUM_REQS - 1:0] dmem_d_valid,
input [(DCACHE_NUM_REQS * 3) - 1:0] dmem_d_bits_opcode,
input [(DCACHE_NUM_REQS * 4) - 1:0] dmem_d_bits_size,
input [(DCACHE_NUM_REQS * DCACHE_NOSM_TAG_WIDTH) - 1:0] dmem_d_bits_source,
input [(DCACHE_NUM_REQS * 32) - 1:0] dmem_d_bits_data,
output [DCACHE_NUM_REQS - 1:0] dmem_d_ready,
input [DCACHE_NUM_REQS - 1:0] dmem_a_ready,
output [DCACHE_NUM_REQS - 1:0] dmem_a_valid,
output [(DCACHE_NUM_REQS * 3) - 1:0] dmem_a_bits_opcode,
output [(DCACHE_NUM_REQS * 4) - 1:0] dmem_a_bits_size,
output [(DCACHE_NUM_REQS * DCACHE_NOSM_TAG_WIDTH) - 1:0] dmem_a_bits_source,
output [(DCACHE_NUM_REQS * 32) - 1:0] dmem_a_bits_address,
output [(DCACHE_NUM_REQS * 4) - 1:0] dmem_a_bits_mask,
output [(DCACHE_NUM_REQS * 32) - 1:0] dmem_a_bits_data,
// smem ------------------------------------------------
input [DCACHE_NUM_REQS - 1:0] smem_d_valid,
input [(DCACHE_NUM_REQS * 3) - 1:0] smem_d_bits_opcode,
input [(DCACHE_NUM_REQS * 4) - 1:0] smem_d_bits_size,
input [(DCACHE_NUM_REQS * DCACHE_NOSM_TAG_WIDTH) - 1:0] smem_d_bits_source,
input [(DCACHE_NUM_REQS * 32) - 1:0] smem_d_bits_data,
output [DCACHE_NUM_REQS - 1:0] smem_d_ready,
input [DCACHE_NUM_REQS - 1:0] smem_a_ready,
output [DCACHE_NUM_REQS - 1:0] smem_a_valid,
output [(DCACHE_NUM_REQS * 3) - 1:0] smem_a_bits_opcode,
output [(DCACHE_NUM_REQS * 4) - 1:0] smem_a_bits_size,
output [(DCACHE_NUM_REQS * DCACHE_NOSM_TAG_WIDTH) - 1:0] smem_a_bits_source,
output [(DCACHE_NUM_REQS * 32) - 1:0] smem_a_bits_address,
output [(DCACHE_NUM_REQS * 4) - 1:0] smem_a_bits_mask,
output [(DCACHE_NUM_REQS * 32) - 1:0] smem_a_bits_data,
// tc --------------------------------------------------
input [NUM_TENSOR_CORES * 3 - 1:0] tc_a_ready,
output [NUM_TENSOR_CORES * 3 - 1:0] tc_a_valid,
output [NUM_TENSOR_CORES * 3 - 1:0] tc_a_bits_write,
output [NUM_TENSOR_CORES * 3 * 32 - 1:0] tc_a_bits_address,
output [NUM_TENSOR_CORES * 3 * TC_TAG_WIDTH - 1:0] tc_a_bits_tag,
output [NUM_TENSOR_CORES * 3 * 32 - 1:0] tc_a_bits_mask,
output [NUM_TENSOR_CORES * 3 * TC_DATA_WIDTH - 1:0] tc_a_bits_data,
output [NUM_TENSOR_CORES * 3 - 1:0] tc_d_ready,
input [NUM_TENSOR_CORES * 3 - 1:0] tc_d_valid,
input [NUM_TENSOR_CORES * 3 * TC_DATA_WIDTH - 1:0] tc_d_bits_data,
input [NUM_TENSOR_CORES * 3 * TC_TAG_WIDTH - 1:0] tc_d_bits_tag,
// shared tmem direct SRAM ports
output [NUM_TENSOR_CORES-1:0] tc_tmem_A_ren,
input [NUM_TENSOR_CORES-1:0] tc_tmem_A_rready,
output [NUM_TENSOR_CORES*9-1:0] tc_tmem_A_raddr,
input [NUM_TENSOR_CORES*`NUM_THREADS*`XLEN-1:0] tc_tmem_A_rdata,
output [NUM_TENSOR_CORES-1:0] tc_tmem_C_ren,
input [NUM_TENSOR_CORES-1:0] tc_tmem_C_rready,
output [NUM_TENSOR_CORES*9-1:0] tc_tmem_C_raddr,
input [NUM_TENSOR_CORES*`NUM_THREADS*`XLEN-1:0] tc_tmem_C_rdata,
output [NUM_TENSOR_CORES-1:0] tc_tmem_C_wen,
input [NUM_TENSOR_CORES-1:0] tc_tmem_C_wready,
output [NUM_TENSOR_CORES*9-1:0] tc_tmem_C_waddr,
output [NUM_TENSOR_CORES*`NUM_THREADS*`XLEN-1:0] tc_tmem_C_wdata,
output [NUM_TENSOR_CORES*`NUM_THREADS*`XLEN/8-1:0] tc_tmem_C_mask,
// gbar ------------------------------------------------
output gbar_req_valid,
output [`NB_WIDTH - 1:0] gbar_req_id,
output [`NC_WIDTH - 1:0] gbar_req_size_m1,
output [`NC_WIDTH - 1:0] gbar_req_core_id,
input gbar_req_ready,
input gbar_rsp_valid,
input [`NB_WIDTH - 1:0] gbar_rsp_id,
// fpu (unused) ----------------------------------------
//
// input fpu_fcsr_flags_valid,
// input [4:0] fpu_fcsr_flags_bits,
// // input [63:0] fpu_store_data,
// input [31:0] fpu_toint_data,
// input fpu_fcsr_rdy,
// input fpu_nack_mem,
// input fpu_illegal_rm,
// input fpu_dec_wen,
// input fpu_dec_ldst,
// input fpu_dec_ren1,
// input fpu_dec_ren2,
// input fpu_dec_ren3,
// input fpu_dec_swap12,
// input fpu_dec_swap23,
// input [1:0] fpu_dec_typeTagIn,
// input [1:0] fpu_dec_typeTagOut,
// input fpu_dec_fromint,
// input fpu_dec_toint,
// input fpu_dec_fastpipe,
// input fpu_dec_fma,
// input fpu_dec_div,
// input fpu_dec_sqrt,
// input fpu_dec_wflags,
// input fpu_sboard_set,
// input fpu_sboard_clr,
// input [4:0] fpu_sboard_clra,
// output fpu_hartid,
// output [31:0] fpu_time,
// output [31:0] fpu_inst,
// output [31:0] fpu_fromint_data,
// output [2:0] fpu_fcsr_rm,
// output fpu_dmem_resp_val,
// output [2:0] fpu_dmem_resp_type,
// output [4:0] fpu_dmem_resp_tag,
// output fpu_valid,
// output fpu_killx,
// output fpu_killm,
// output fpu_keep_clock_enabled,
// accelerator cisc csr --------------------------------
input wire [31:0] acc_read_in,
output wire [31:0] acc_write_out,
output wire acc_write_en,
input downstream_mem_busy,
output finished,
input traceStall,
output wfi
);
logic [3:0] intr_counter;
logic msip_1d, intr_reset;
logic busy;
reg busy_prev;
reg finished_reg;
assign intr_reset = |intr_counter;
/* busy and interrupts */
always @(posedge clock) begin
msip_1d <= interrupts_msip;
if (reset) begin
busy_prev <= 1'b0;
finished_reg <= 1'b0;
intr_counter <= 4'h8;
end else begin
// Vortex core's busy signal goes up some cycles after the reset,
// so we can't simply use ~busy as finished because of the initial
// ephemeral state. Instead detect the *negedge* of the busy
// signal and use that to indicate finish.
busy_prev <= busy;
if (busy_prev && !busy) begin
finished_reg <= 1'b1;
end
if (~msip_1d && interrupts_msip) begin
// rising edge
intr_counter <= 4'h7;
end else if (intr_counter <= 4'h7) begin
intr_counter <= intr_counter > 0 ? intr_counter - 4'h1 : 4'h0;
end
end
end
assign finished = finished_reg;
assign wfi = 1'b0; // FIXME: unused
// ------------------------------------------------------------------------
// TL <-> Vortex core-cache interface adapter
// ------------------------------------------------------------------------
VX_mem_bus_if #(
.DATA_SIZE (ICACHE_WORD_SIZE),
.TAG_WIDTH (ICACHE_TAG_WIDTH)
) icache_bus_if();
// NOTE(hansung): need to use DCACHE_NOSM_TAG_WIDTH here instead of
// DCACHE_TAG_WIDTH; the latter is only used inside the core to
// differentiate between requests going to the cache vs. sharedmem.
// FIXME: DCACHE_NUM_REQS is assumed to be the same as NUM_LANES as of
// now.
VX_mem_bus_if #(
.DATA_SIZE (DCACHE_WORD_SIZE),
.TAG_WIDTH (DCACHE_NOSM_TAG_WIDTH)
) dcache_bus_if[DCACHE_NUM_REQS]();
VX_mem_bus_if #(
.DATA_SIZE (DCACHE_WORD_SIZE),
.TAG_WIDTH (DCACHE_NOSM_TAG_WIDTH)
) smem_bus_if[DCACHE_NUM_REQS]();
// always @(posedge clock) begin
// `ASSERT(DCACHE_NUM_REQS == NUM_THREADS, "DCACHE_NUM_REQS doesn't match NUM_THREADS");
// end
// imem -------------------------------------------------------------------
assign icache_bus_if.rsp_valid = imem_0_d_valid;
// TODO: hardcoded DCACHE_WORD_SIZE = 4
assign icache_bus_if.rsp_data.data = imem_0_d_bits_data;
assign icache_bus_if.rsp_data.tag = imem_0_d_bits_source[ICACHE_TAG_WIDTH-1:0];
assign imem_0_d_ready = icache_bus_if.rsp_ready;
// always @(posedge clock) begin
// if (icache_req_if.valid && icache_req_if.ready)
// icache_rsp_if.tag <= icache_req_if.tag;
// end
assign imem_0_a_bits_source = {32'b0, icache_bus_if.req_data.tag}[ICACHE_TAG_WIDTH-1:0];
assign imem_0_a_valid = icache_bus_if.req_valid;
assign imem_0_a_bits_address = {icache_bus_if.req_data.addr, 2'b0};
assign icache_bus_if.req_ready = imem_0_a_ready;
assign imem_0_a_bits_data = 32'd0;
assign imem_0_a_bits_mask = 4'hf;
// assign imem_0_a_bits_corrupt = 1'b0;
// assign imem_0_a_bits_param = 3'd0;
assign imem_0_a_bits_size = 4'd2; // 32b
assign imem_0_a_bits_opcode = 3'd4; // Get
// dmem -------------------------------------------------------------------
// Vortex core does not accept write acks; filter them out here
generate
for (genvar i = 0; i < DCACHE_NUM_REQS; i++) begin
assign dcache_bus_if[i].rsp_valid =
(dmem_d_valid[i] && (dmem_d_bits_opcode[i * 3 +: 3] !== 3'd0 /*AccessAck*/));
// Data and tag assignment for dcache
assign dcache_bus_if[i].rsp_data.data = dmem_d_bits_data[i * 32 +: 32];
assign dcache_bus_if[i].rsp_data.tag = dmem_d_bits_source[i * DCACHE_NOSM_TAG_WIDTH +: DCACHE_NOSM_TAG_WIDTH];
// Handling write ACKs, setting ready bit for dcache
assign dmem_d_ready[i] = dcache_bus_if[i].rsp_ready ||
(dmem_d_valid[i] && (dmem_d_bits_opcode[i * 3 +: 3] == 3'd0 /*AccessAck*/));
// Request validity and address/data/source assignment for dcache
assign dmem_a_valid[i] = dcache_bus_if[i].req_valid;
assign dmem_a_bits_address[i * 32 +: 32] = {dcache_bus_if[i].req_data.addr, 2'b0};
assign dmem_a_bits_data[i * 32 +: 32] = dcache_bus_if[i].req_data.data;
assign dmem_a_bits_source[i * DCACHE_NOSM_TAG_WIDTH +: DCACHE_NOSM_TAG_WIDTH] = dcache_bus_if[i].req_data.tag;
// Opcode, size, and mask assignment for dcache
assign dmem_a_bits_opcode[i * 3 +: 3] =
dcache_bus_if[i].req_data.rw ?
(&dcache_bus_if[i].req_data.byteen ? 3'd0 /*PutFull*/ : 3'd1 /*PutPartial*/)
: 3'd4 /*Get*/;
assign dmem_a_bits_size[i * 4 +: 4] = 4'd2; // Fixed size
assign dmem_a_bits_mask[i * 4 +: 4] = dcache_bus_if[i].req_data.byteen;
// Setting request ready signal for dcache
assign dcache_bus_if[i].req_ready = dmem_a_ready[i];
// Data and tag assignment for smem
assign smem_bus_if[i].rsp_valid =
(smem_d_valid[i] && (smem_d_bits_opcode[i * 3 +: 3] !== 3'd0 /*AccessAck*/));
assign smem_bus_if[i].rsp_data.data = smem_d_bits_data[i * 32 +: 32];
assign smem_bus_if[i].rsp_data.tag = smem_d_bits_source[i * DCACHE_NOSM_TAG_WIDTH +: DCACHE_NOSM_TAG_WIDTH];
// Handling write ACKs, setting ready bit for smem
assign smem_d_ready[i] = smem_bus_if[i].rsp_ready ||
(smem_d_valid[i] && (smem_d_bits_opcode[i * 3 +: 3] == 3'd0 /*AccessAck*/));
// Request validity and address/data/source assignment for smem
assign smem_a_valid[i] = smem_bus_if[i].req_valid;
assign smem_a_bits_address[i * 32 +: 32] = {smem_bus_if[i].req_data.addr, 2'b0};
assign smem_a_bits_data[i * 32 +: 32] = smem_bus_if[i].req_data.data;
assign smem_a_bits_source[i * DCACHE_NOSM_TAG_WIDTH +: DCACHE_NOSM_TAG_WIDTH] = smem_bus_if[i].req_data.tag;
// Opcode, size, and mask assignment for smem
assign smem_a_bits_opcode[i * 3 +: 3] =
smem_bus_if[i].req_data.rw ?
(&smem_bus_if[i].req_data.byteen ? 3'd0 /*PutFull*/ : 3'd1 /*PutPartial*/)
: 3'd4 /*Get*/;
assign smem_a_bits_size[i * 4 +: 4] = 4'd2; // Fixed size
assign smem_a_bits_mask[i * 4 +: 4] = smem_bus_if[i].req_data.byteen;
// Setting request ready signal for smem
assign smem_bus_if[i].req_ready = smem_a_ready[i];
end
endgenerate
// tc ---------------------------------------------------------------------
VX_tc_bus_if #(.TAG_WIDTH(TC_TAG_WIDTH)) tc_p0_bus_if[NUM_TENSOR_CORES]();
VX_tc_bus_if #(.TAG_WIDTH(TC_TAG_WIDTH)) tc_p2_bus_if[NUM_TENSOR_CORES]();
for (genvar tc = 0; tc < NUM_TENSOR_CORES; ++tc) begin : g_tc_ports
localparam P0 = tc * 3;
localparam P1 = tc * 3 + 1;
localparam P2 = tc * 3 + 2;
assign tc_a_valid[P0] = tc_p0_bus_if[tc].req_valid;
assign tc_a_valid[P1] = 1'b0;
assign tc_a_valid[P2] = tc_p2_bus_if[tc].req_valid;
assign tc_a_bits_write[P0] = tc_p0_bus_if[tc].req_data.rw;
assign tc_a_bits_write[P1] = 1'b0;
assign tc_a_bits_write[P2] = tc_p2_bus_if[tc].req_data.rw;
assign tc_a_bits_address[P0 * 32 +: 32] = tc_p0_bus_if[tc].req_data.addr;
assign tc_a_bits_address[P1 * 32 +: 32] = 32'b0;
assign tc_a_bits_address[P2 * 32 +: 32] = tc_p2_bus_if[tc].req_data.addr;
assign tc_a_bits_tag[P0 * TC_TAG_WIDTH +: TC_TAG_WIDTH] = tc_p0_bus_if[tc].req_data.tag;
assign tc_a_bits_tag[P1 * TC_TAG_WIDTH +: TC_TAG_WIDTH] = '0;
assign tc_a_bits_tag[P2 * TC_TAG_WIDTH +: TC_TAG_WIDTH] = tc_p2_bus_if[tc].req_data.tag;
assign tc_a_bits_mask[P0 * 32 +: 32] = tc_p0_bus_if[tc].req_data.byteen;
assign tc_a_bits_mask[P1 * 32 +: 32] = '0;
assign tc_a_bits_mask[P2 * 32 +: 32] = tc_p2_bus_if[tc].req_data.byteen;
assign tc_a_bits_data[P0 * TC_DATA_WIDTH +: TC_DATA_WIDTH] = tc_p0_bus_if[tc].req_data.data;
assign tc_a_bits_data[P1 * TC_DATA_WIDTH +: TC_DATA_WIDTH] = '0;
assign tc_a_bits_data[P2 * TC_DATA_WIDTH +: TC_DATA_WIDTH] = tc_p2_bus_if[tc].req_data.data;
assign tc_p0_bus_if[tc].req_ready = tc_a_ready[P0];
assign tc_p0_bus_if[tc].rsp_valid = tc_d_valid[P0];
assign tc_p0_bus_if[tc].rsp_data.data = tc_d_bits_data[P0 * TC_DATA_WIDTH +: TC_DATA_WIDTH];
assign tc_p0_bus_if[tc].rsp_data.tag = tc_d_bits_tag[P0 * TC_TAG_WIDTH +: TC_TAG_WIDTH];
assign tc_p2_bus_if[tc].req_ready = tc_a_ready[P2];
assign tc_p2_bus_if[tc].rsp_valid = tc_d_valid[P2];
assign tc_p2_bus_if[tc].rsp_data.data = tc_d_bits_data[P2 * TC_DATA_WIDTH +: TC_DATA_WIDTH];
assign tc_p2_bus_if[tc].rsp_data.tag = tc_d_bits_tag[P2 * TC_TAG_WIDTH +: TC_TAG_WIDTH];
assign tc_d_ready[P0] = tc_p0_bus_if[tc].rsp_ready;
assign tc_d_ready[P1] = 1'b0;
assign tc_d_ready[P2] = tc_p2_bus_if[tc].rsp_ready;
end
// gbar -------------------------------------------------------------------
`ifdef GBAR_ENABLE
VX_gbar_bus_if gbar_bus_if();
assign gbar_req_valid = gbar_bus_if.req_valid;
assign gbar_req_id = gbar_bus_if.req_id;
assign gbar_req_size_m1 = gbar_bus_if.req_size_m1;
assign gbar_req_core_id = gbar_bus_if.req_core_id;
assign gbar_bus_if.req_ready = gbar_req_ready;
assign gbar_bus_if.rsp_valid = gbar_rsp_valid;
assign gbar_bus_if.rsp_id = gbar_rsp_id;
`endif
// fpu --------------------------------------------------------------------
// assign {fpu_hartid, fpu_time, fpu_inst, fpu_fromint_data, fpu_fcsr_rm, fpu_dmem_resp_val, fpu_dmem_resp_type,
// fpu_dmem_resp_tag, fpu_valid, fpu_killx, fpu_killm, fpu_keep_clock_enabled} = '0;
logic sim_ebreak;
logic [`NUM_REGS-1:0][`XLEN-1:0] sim_wb_value;
logic [3:0] reset_start_counter;
logic core_reset;
always @(posedge clock) begin
if (reset) begin
reset_start_counter <= 4'ha;
end else begin
if (reset_start_counter > 4'h0) begin
reset_start_counter <= reset_start_counter - 4'h1;
end
end
end
// Delay reset signal by a few cycles to make time for resetting the DCR
// (device configuration registers).
assign core_reset = reset || (reset_start_counter != 4'h0); // || intr_reset;
// A small FSM that tries to set DCR "properly" in the same order as
// defined in VX_types.vh.
//
// DCR is a device configuration register that holds (among other things)
// the startup address for the kernel, nominally set to 0x80000000.
// TODO: Original Vortex code buffers dcr_bus by one cycle when
// SOCKET_SIZE > 1, as below. Might want to check if we need to do the
// same
// `BUFFER_DCR_BUS_IF (core_dcr_bus_if, dcr_bus_if, (`SOCKET_SIZE > 1));
logic [`VX_DCR_ADDR_BITS-1:0] dcr_state;
logic [`VX_DCR_ADDR_BITS-1:0] dcr_state_n;
logic dcr_write_valid;
logic [`VX_DCR_ADDR_WIDTH-1:0] dcr_write_addr;
logic [`VX_DCR_DATA_WIDTH-1:0] dcr_write_data;
always @(posedge clock) begin
if (reset) begin
dcr_state <= `VX_DCR_ADDR_BITS'h000;
end else begin
dcr_state <= dcr_state_n;
end
end
always @(*) begin
dcr_state_n = dcr_state;
dcr_write_valid = 1'b0;
dcr_write_addr = `VX_DCR_ADDR_WIDTH'b0;
dcr_write_data = `VX_DCR_DATA_WIDTH'b0;
case (dcr_state)
`VX_DCR_ADDR_BITS'h000: begin
dcr_state_n = `VX_DCR_BASE_STATE_BEGIN;
end
`VX_DCR_BASE_STATE_BEGIN: begin
dcr_state_n = `VX_DCR_BASE_STARTUP_ADDR1;
dcr_write_valid = 1'b1;
dcr_write_addr = `VX_DCR_BASE_STARTUP_ADDR0;
dcr_write_data = STARTUP_ADDR[31:0];
end
`VX_DCR_BASE_STARTUP_ADDR1: begin
dcr_state_n = `VX_DCR_BASE_MPM_CLASS;
dcr_write_valid = 1'b1;
dcr_write_addr = `VX_DCR_BASE_STARTUP_ADDR1;
dcr_write_data = STARTUP_ADDR[63:32];
end
`VX_DCR_BASE_MPM_CLASS: begin
dcr_state_n = `VX_DCR_BASE_STATE_END;
dcr_write_valid = 1'b1;
dcr_write_addr = `VX_DCR_BASE_MPM_CLASS;
dcr_write_data = `VX_DCR_DATA_WIDTH'h0;
end
`VX_DCR_BASE_STATE_END: begin
dcr_state_n = dcr_state;
dcr_write_valid = 1'b0;
end
endcase
end
VX_dcr_bus_if dcr_bus_if();
assign dcr_bus_if.write_valid = dcr_write_valid;
assign dcr_bus_if.write_addr = dcr_write_addr;
assign dcr_bus_if.write_data = dcr_write_data;
VX_mem_perf_if mem_perf_if();
// TODO: SCOPE_IO_BIND should be socket id
VX_core #(
.CORE_ID (CORE_ID),
.TENSOR_FP16 (TENSOR_FP16),
.NUM_TENSOR_CORES (NUM_TENSOR_CORES)
) core (
`SCOPE_IO_BIND (0)
.clk (clock),
.reset (core_reset),
`ifdef PERF_ENABLE
// NOTE unused
.mem_perf_if (mem_perf_if),
`endif
.dcr_bus_if (dcr_bus_if),
.smem_bus_if (smem_bus_if),
.dcache_bus_if (dcache_bus_if),
.icache_bus_if (icache_bus_if),
`ifdef GBAR_ENABLE
.gbar_bus_if (gbar_bus_if),
`endif
.tensor_smem_A_if (tc_p0_bus_if),
`ifdef EXT_T_BLACKWELL
.tensor_tmem_A_ren(tc_tmem_A_ren),
.tensor_tmem_A_rready(tc_tmem_A_rready),
.tensor_tmem_A_raddr(tc_tmem_A_raddr),
.tensor_tmem_A_rdata(tc_tmem_A_rdata),
.tensor_tmem_C_ren(tc_tmem_C_ren),
.tensor_tmem_C_rready(tc_tmem_C_rready),
.tensor_tmem_C_raddr(tc_tmem_C_raddr),
.tensor_tmem_C_rdata(tc_tmem_C_rdata),
.tensor_tmem_C_wen(tc_tmem_C_wen),
.tensor_tmem_C_wready(tc_tmem_C_wready),
.tensor_tmem_C_waddr(tc_tmem_C_waddr),
.tensor_tmem_C_wdata(tc_tmem_C_wdata),
.tensor_tmem_C_mask(tc_tmem_C_mask),
.tensor_smem_B_if (tc_p2_bus_if),
`else
.tensor_tmem_A_ren(tc_tmem_A_ren),
.tensor_tmem_A_rready(tc_tmem_A_rready),
.tensor_tmem_A_raddr(tc_tmem_A_raddr),
.tensor_tmem_A_rdata(tc_tmem_A_rdata),
.tensor_tmem_C_ren(tc_tmem_C_ren),
.tensor_tmem_C_rready(tc_tmem_C_rready),
.tensor_tmem_C_raddr(tc_tmem_C_raddr),
.tensor_tmem_C_rdata(tc_tmem_C_rdata),
.tensor_tmem_C_wen(tc_tmem_C_wen),
.tensor_tmem_C_wready(tc_tmem_C_wready),
.tensor_tmem_C_waddr(tc_tmem_C_waddr),
.tensor_tmem_C_wdata(tc_tmem_C_wdata),
.tensor_tmem_C_mask(tc_tmem_C_mask),
.tensor_smem_B_if (tc_p2_bus_if),
`endif
.sim_ebreak (sim_ebreak),
.sim_wb_value (sim_wb_value),
.busy (busy),
.downstream_mem_busy(downstream_mem_busy),
.acc_read_in (acc_read_in),
.acc_write_out (acc_write_out),
.acc_write_en (acc_write_en)
);
// VX_dcache_req_if #(
// .NUM_REQS (`DCACHE_NUM_REQS),
// .WORD_SIZE (`DCACHE_WORD_SIZE),
// .TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH)
// ) dcache_req_if();
// VX_dcache_rsp_if #(
// .NUM_REQS (`DCACHE_NUM_REQS),
// .WORD_SIZE (`DCACHE_WORD_SIZE),
// .TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH)
// ) dcache_rsp_if();
//
// VX_icache_req_if #(
// .WORD_SIZE (`ICACHE_WORD_SIZE),
// .TAG_WIDTH (`ICACHE_CORE_TAG_WIDTH)
// ) icache_req_if();
// VX_icache_rsp_if #(
// .WORD_SIZE (`ICACHE_WORD_SIZE),
// .TAG_WIDTH (`ICACHE_CORE_TAG_WIDTH)
// ) icache_rsp_if();
// VX_pipeline #(
// .CORE_ID(CORE_ID)
// ) pipeline (
// `SCOPE_BIND_VX_core_pipeline
// `ifdef PERF_ENABLE
// .perf_memsys_if (perf_memsys_if),
// `endif
// .clk(clock),
// .reset(reset || intr_reset),
// .irq(1'b0/*intr_reset*/),
// // Dcache core request
// .dcache_req_valid (dcache_req_if.valid),
// .dcache_req_rw (dcache_req_if.rw),
// .dcache_req_byteen (dcache_req_if.byteen),
// .dcache_req_addr (dcache_req_if.addr),
// .dcache_req_data (dcache_req_if.data),
// .dcache_req_tag (dcache_req_if.tag),
// .dcache_req_ready (dcache_req_if.ready),
// // Dcache core reponse
// .dcache_rsp_valid (dcache_rsp_if.valid),
// .dcache_rsp_tmask (dcache_rsp_if.tmask),
// .dcache_rsp_data (dcache_rsp_if.data),
// .dcache_rsp_tag (dcache_rsp_if.tag),
// .dcache_rsp_ready (dcache_rsp_if.ready),
// // Icache core request
// .icache_req_valid (icache_req_if.valid),
// .icache_req_addr (icache_req_if.addr),
// .icache_req_tag (icache_req_if.tag),
// .icache_req_ready (icache_req_if.ready),
// // Icache core reponse
// .icache_rsp_valid (icache_rsp_if.valid),
// .icache_rsp_data (icache_rsp_if.data),
// .icache_rsp_tag (icache_rsp_if.tag),
// .icache_rsp_ready (icache_rsp_if.ready),
// // Status
// .busy(busy)
// );
logic [31:0] finish_counter;
always @(posedge clock) begin
if (reset) begin
finish_counter <= 32'd0;
end else begin
if (finished) begin
finish_counter <= finish_counter + 32'd1;
end
end
end
// give slack for other cores to finish
wire all_cores_finished = (finish_counter > 32'd10000);
`ifdef SIMULATION
always @(posedge clock) begin
if (!reset) begin
if ((CORE_ID == '0) && all_cores_finished) begin
$display("simulation has probably ended. exiting");
$finish();
end
if (busy_prev && !busy) begin
$display("---------------- core%2d has no more active warps ----------------", CORE_ID);
// TODO: lane assumed to be 4
// `ifndef SYNTHESIS
// for (integer j = 0; j < `NUM_WARPS; j++) begin
// $display("warp %2d", j);
// for (integer k = 0; k < `NUM_REGS; k += 1)
// $display("x%2d: %08x %08x %08x %08x", k,
// pipeline.issue.gpr_stage.iports[/*thread*/0].dp_ram1.not_out_reg.reg_dump.ram[j * `NUM_REGS + k],
// pipeline.issue.gpr_stage.iports[/*thread*/1].dp_ram1.not_out_reg.reg_dump.ram[j * `NUM_REGS + k],
// pipeline.issue.gpr_stage.iports[/*thread*/2].dp_ram1.not_out_reg.reg_dump.ram[j * `NUM_REGS + k],
// pipeline.issue.gpr_stage.iports[/*thread*/3].dp_ram1.not_out_reg.reg_dump.ram[j * `NUM_REGS + k]);
// end
// `endif
// @(posedge clock) $finish();
end
end
end
`endif
endmodule : Vortex

View File

@@ -1,265 +0,0 @@
`include "VX_define.vh"
module VX_csr_data #(
parameter CORE_ID = 0
) (
input wire clk,
input wire reset,
`ifdef PERF_ENABLE
`ifdef EXT_TEX_ENABLE
VX_perf_tex_if.slave perf_tex_if,
`endif
VX_perf_memsys_if.slave perf_memsys_if,
VX_perf_pipeline_if.slave perf_pipeline_if,
`endif
VX_cmt_to_csr_if.slave cmt_to_csr_if,
VX_fetch_to_csr_if.slave fetch_to_csr_if,
`ifdef EXT_F_ENABLE
VX_fpu_to_csr_if.slave fpu_to_csr_if,
`endif
`ifdef EXT_TEX_ENABLE
VX_tex_csr_if.master tex_csr_if,
`endif
input wire read_enable,
input wire [`UUID_BITS-1:0] read_uuid,
input wire[`CSR_ADDR_BITS-1:0] read_addr,
input wire[`NW_BITS-1:0] read_wid,
output wire[31:0] read_data,
input wire write_enable,
input wire [`UUID_BITS-1:0] write_uuid,
input wire[`CSR_ADDR_BITS-1:0] write_addr,
input wire[`NW_BITS-1:0] write_wid,
input wire[31:0] write_data,
input wire busy
);
import fpu_types::*;
reg [`CSR_WIDTH-1:0] csr_satp;
reg [`CSR_WIDTH-1:0] csr_mstatus;
reg [`CSR_WIDTH-1:0] csr_medeleg;
reg [`CSR_WIDTH-1:0] csr_mideleg;
reg [`CSR_WIDTH-1:0] csr_mie;
reg [`CSR_WIDTH-1:0] csr_mtvec;
reg [`CSR_WIDTH-1:0] csr_mepc;
reg [`CSR_WIDTH-1:0] csr_pmpcfg [0:0];
reg [`CSR_WIDTH-1:0] csr_pmpaddr [0:0];
reg [63:0] csr_cycle;
reg [63:0] csr_instret;
reg [`NUM_WARPS-1:0][`INST_FRM_BITS+`FFLAGS_BITS-1:0] fcsr;
always @(posedge clk) begin
if (reset) begin
fcsr <= '0;
end else begin
`ifdef EXT_F_ENABLE
if (fpu_to_csr_if.write_enable) begin
fcsr[fpu_to_csr_if.write_wid][`FFLAGS_BITS-1:0] <= fcsr[fpu_to_csr_if.write_wid][`FFLAGS_BITS-1:0]
| fpu_to_csr_if.write_fflags;
end
`endif
if (write_enable) begin
case (write_addr)
`CSR_FFLAGS: fcsr[write_wid][`FFLAGS_BITS-1:0] <= write_data[`FFLAGS_BITS-1:0];
`CSR_FRM: fcsr[write_wid][`INST_FRM_BITS+`FFLAGS_BITS-1:`FFLAGS_BITS] <= write_data[`INST_FRM_BITS-1:0];
`CSR_FCSR: fcsr[write_wid] <= write_data[`FFLAGS_BITS+`INST_FRM_BITS-1:0];
`CSR_SATP: csr_satp <= write_data[`CSR_WIDTH-1:0];
`CSR_MSTATUS: csr_mstatus <= write_data[`CSR_WIDTH-1:0];
`CSR_MEDELEG: csr_medeleg <= write_data[`CSR_WIDTH-1:0];
`CSR_MIDELEG: csr_mideleg <= write_data[`CSR_WIDTH-1:0];
`CSR_MIE: csr_mie <= write_data[`CSR_WIDTH-1:0];
`CSR_MTVEC: csr_mtvec <= write_data[`CSR_WIDTH-1:0];
`CSR_MEPC: csr_mepc <= write_data[`CSR_WIDTH-1:0];
`CSR_PMPCFG0: csr_pmpcfg[0] <= write_data[`CSR_WIDTH-1:0];
`CSR_PMPADDR0: csr_pmpaddr[0] <= write_data[`CSR_WIDTH-1:0];
default: begin
`ifdef EXT_TEX_ENABLE
`ASSERT((write_addr == `CSR_TEX_UNIT)
|| (write_addr >= `CSR_TEX_STATE_BEGIN
&& write_addr < `CSR_TEX_STATE_END),
("%t: *** invalid CSR write address: %0h (#%0d)", $time, write_addr, write_uuid));
`else
`ASSERT(~write_enable, ("%t: *** invalid CSR write address: %0h (#%0d)", $time, write_addr, write_uuid));
`endif
end
endcase
end
end
end
`UNUSED_VAR (write_data)
// TEX CSRs
`ifdef EXT_TEX_ENABLE
assign tex_csr_if.write_enable = write_enable;
assign tex_csr_if.write_addr = write_addr;
assign tex_csr_if.write_data = write_data;
assign tex_csr_if.write_uuid = write_uuid;
`endif
always @(posedge clk) begin
if (reset) begin
csr_cycle <= 0;
csr_instret <= 0;
end else begin
if (busy) begin
csr_cycle <= csr_cycle + 1;
end
if (cmt_to_csr_if.valid) begin
csr_instret <= csr_instret + 64'(cmt_to_csr_if.commit_size);
end
end
end
reg [31:0] read_data_r;
reg read_addr_valid_r;
always @(*) begin
read_data_r = 'x;
read_addr_valid_r = 1;
case (read_addr)
`CSR_FFLAGS : read_data_r = 32'(fcsr[read_wid][`FFLAGS_BITS-1:0]);
`CSR_FRM : read_data_r = 32'(fcsr[read_wid][`INST_FRM_BITS+`FFLAGS_BITS-1:`FFLAGS_BITS]);
`CSR_FCSR : read_data_r = 32'(fcsr[read_wid]);
`CSR_WTID ,
`CSR_LTID ,
`CSR_LWID : read_data_r = 32'(read_wid);
`CSR_GTID ,
/*`CSR_MHARTID ,*/
`CSR_GWID : read_data_r = CORE_ID * `NUM_WARPS + 32'(read_wid);
`CSR_GCID : read_data_r = CORE_ID;
`CSR_TMASK : read_data_r = 32'(fetch_to_csr_if.thread_masks[read_wid]);
`CSR_NT : read_data_r = `NUM_THREADS;
`CSR_NW : read_data_r = `NUM_WARPS;
`CSR_NC : read_data_r = `NUM_CORES * `NUM_CLUSTERS;
`CSR_MCYCLE : read_data_r = csr_cycle[31:0];
`CSR_MCYCLE_H : read_data_r = 32'(csr_cycle[`PERF_CTR_BITS-1:32]);
`CSR_MINSTRET : read_data_r = csr_instret[31:0];
`CSR_MINSTRET_H : read_data_r = 32'(csr_instret[`PERF_CTR_BITS-1:32]);
`ifdef PERF_ENABLE
// PERF: pipeline
`CSR_MPM_IBUF_ST : read_data_r = perf_pipeline_if.ibf_stalls[31:0];
`CSR_MPM_IBUF_ST_H : read_data_r = 32'(perf_pipeline_if.ibf_stalls[`PERF_CTR_BITS-1:32]);
`CSR_MPM_SCRB_ST : read_data_r = perf_pipeline_if.scb_stalls[31:0];
`CSR_MPM_SCRB_ST_H : read_data_r = 32'(perf_pipeline_if.scb_stalls[`PERF_CTR_BITS-1:32]);
`CSR_MPM_ALU_ST : read_data_r = perf_pipeline_if.alu_stalls[31:0];
`CSR_MPM_ALU_ST_H : read_data_r = 32'(perf_pipeline_if.alu_stalls[`PERF_CTR_BITS-1:32]);
`CSR_MPM_LSU_ST : read_data_r = perf_pipeline_if.lsu_stalls[31:0];
`CSR_MPM_LSU_ST_H : read_data_r = 32'(perf_pipeline_if.lsu_stalls[`PERF_CTR_BITS-1:32]);
`CSR_MPM_CSR_ST : read_data_r = perf_pipeline_if.csr_stalls[31:0];
`CSR_MPM_CSR_ST_H : read_data_r = 32'(perf_pipeline_if.csr_stalls[`PERF_CTR_BITS-1:32]);
`ifdef EXT_F_ENABLE
`CSR_MPM_FPU_ST : read_data_r = perf_pipeline_if.fpu_stalls[31:0];
`CSR_MPM_FPU_ST_H : read_data_r = 32'(perf_pipeline_if.fpu_stalls[`PERF_CTR_BITS-1:32]);
`else
`CSR_MPM_FPU_ST : read_data_r = '0;
`CSR_MPM_FPU_ST_H : read_data_r = '0;
`endif
`CSR_MPM_GPU_ST : read_data_r = perf_pipeline_if.gpu_stalls[31:0];
`CSR_MPM_GPU_ST_H : read_data_r = 32'(perf_pipeline_if.gpu_stalls[`PERF_CTR_BITS-1:32]);
// PERF: decode
`CSR_MPM_LOADS : read_data_r = perf_pipeline_if.loads[31:0];
`CSR_MPM_LOADS_H : read_data_r = 32'(perf_pipeline_if.loads[`PERF_CTR_BITS-1:32]);
`CSR_MPM_STORES : read_data_r = perf_pipeline_if.stores[31:0];
`CSR_MPM_STORES_H : read_data_r = 32'(perf_pipeline_if.stores[`PERF_CTR_BITS-1:32]);
`CSR_MPM_BRANCHES : read_data_r = perf_pipeline_if.branches[31:0];
`CSR_MPM_BRANCHES_H : read_data_r = 32'(perf_pipeline_if.branches[`PERF_CTR_BITS-1:32]);
// PERF: icache
`CSR_MPM_ICACHE_READS : read_data_r = perf_memsys_if.icache_reads[31:0];
`CSR_MPM_ICACHE_READS_H : read_data_r = 32'(perf_memsys_if.icache_reads[`PERF_CTR_BITS-1:32]);
`CSR_MPM_ICACHE_MISS_R : read_data_r = perf_memsys_if.icache_read_misses[31:0];
`CSR_MPM_ICACHE_MISS_R_H : read_data_r = 32'(perf_memsys_if.icache_read_misses[`PERF_CTR_BITS-1:32]);
// PERF: dcache
`CSR_MPM_DCACHE_READS : read_data_r = perf_memsys_if.dcache_reads[31:0];
`CSR_MPM_DCACHE_READS_H : read_data_r = 32'(perf_memsys_if.dcache_reads[`PERF_CTR_BITS-1:32]);
`CSR_MPM_DCACHE_WRITES : read_data_r = perf_memsys_if.dcache_writes[31:0];
`CSR_MPM_DCACHE_WRITES_H : read_data_r = 32'(perf_memsys_if.dcache_writes[`PERF_CTR_BITS-1:32]);
`CSR_MPM_DCACHE_MISS_R : read_data_r = perf_memsys_if.dcache_read_misses[31:0];
`CSR_MPM_DCACHE_MISS_R_H : read_data_r = 32'(perf_memsys_if.dcache_read_misses[`PERF_CTR_BITS-1:32]);
`CSR_MPM_DCACHE_MISS_W : read_data_r = perf_memsys_if.dcache_write_misses[31:0];
`CSR_MPM_DCACHE_MISS_W_H : read_data_r = 32'(perf_memsys_if.dcache_write_misses[`PERF_CTR_BITS-1:32]);
`CSR_MPM_DCACHE_BANK_ST : read_data_r = perf_memsys_if.dcache_bank_stalls[31:0];
`CSR_MPM_DCACHE_BANK_ST_H : read_data_r = 32'(perf_memsys_if.dcache_bank_stalls[`PERF_CTR_BITS-1:32]);
`CSR_MPM_DCACHE_MSHR_ST : read_data_r = perf_memsys_if.dcache_mshr_stalls[31:0];
`CSR_MPM_DCACHE_MSHR_ST_H : read_data_r = 32'(perf_memsys_if.dcache_mshr_stalls[`PERF_CTR_BITS-1:32]);
// PERF: smem
`CSR_MPM_SMEM_READS : read_data_r = perf_memsys_if.smem_reads[31:0];
`CSR_MPM_SMEM_READS_H : read_data_r = 32'(perf_memsys_if.smem_reads[`PERF_CTR_BITS-1:32]);
`CSR_MPM_SMEM_WRITES : read_data_r = perf_memsys_if.smem_writes[31:0];
`CSR_MPM_SMEM_WRITES_H : read_data_r = 32'(perf_memsys_if.smem_writes[`PERF_CTR_BITS-1:32]);
`CSR_MPM_SMEM_BANK_ST : read_data_r = perf_memsys_if.smem_bank_stalls[31:0];
`CSR_MPM_SMEM_BANK_ST_H : read_data_r = 32'(perf_memsys_if.smem_bank_stalls[`PERF_CTR_BITS-1:32]);
// PERF: memory
`CSR_MPM_MEM_READS : read_data_r = perf_memsys_if.mem_reads[31:0];
`CSR_MPM_MEM_READS_H : read_data_r = 32'(perf_memsys_if.mem_reads[`PERF_CTR_BITS-1:32]);
`CSR_MPM_MEM_WRITES : read_data_r = perf_memsys_if.mem_writes[31:0];
`CSR_MPM_MEM_WRITES_H : read_data_r = 32'(perf_memsys_if.mem_writes[`PERF_CTR_BITS-1:32]);
`CSR_MPM_MEM_LAT : read_data_r = perf_memsys_if.mem_latency[31:0];
`CSR_MPM_MEM_LAT_H : read_data_r = 32'(perf_memsys_if.mem_latency[`PERF_CTR_BITS-1:32]);
`ifdef EXT_TEX_ENABLE
// PERF: texunit
`CSR_MPM_TEX_READS : read_data_r = perf_tex_if.mem_reads[31:0];
`CSR_MPM_TEX_READS_H : read_data_r = 32'(perf_tex_if.mem_reads[`PERF_CTR_BITS-1:32]);
`CSR_MPM_TEX_LAT : read_data_r = perf_tex_if.mem_latency[31:0];
`CSR_MPM_TEX_LAT_H : read_data_r = 32'(perf_tex_if.mem_latency[`PERF_CTR_BITS-1:32]);
`endif
// PERF: reserved
`CSR_MPM_RESERVED : read_data_r = '0;
`CSR_MPM_RESERVED_H : read_data_r = '0;
`endif
`CSR_SATP : read_data_r = 32'(csr_satp);
`CSR_MSTATUS : read_data_r = 32'(csr_mstatus);
`CSR_MISA : read_data_r = `ISA_CODE;
`CSR_MEDELEG : read_data_r = 32'(csr_medeleg);
`CSR_MIDELEG : read_data_r = 32'(csr_mideleg);
`CSR_MIE : read_data_r = 32'(csr_mie);
`CSR_MTVEC : read_data_r = 32'(csr_mtvec);
`CSR_MEPC : read_data_r = 32'(csr_mepc);
`CSR_PMPCFG0 : read_data_r = 32'(csr_pmpcfg[0]);
`CSR_PMPADDR0 : read_data_r = 32'(csr_pmpaddr[0]);
`CSR_MVENDORID : read_data_r = `VENDOR_ID;
`CSR_MARCHID : read_data_r = `ARCHITECTURE_ID;
`CSR_MIMPID : read_data_r = `IMPLEMENTATION_ID;
default: begin
if ((read_addr >= `CSR_MPM_BASE && read_addr < (`CSR_MPM_BASE + 32))
|| (read_addr >= `CSR_MPM_BASE_H && read_addr < (`CSR_MPM_BASE_H + 32))) begin
read_addr_valid_r = 1;
end else
`ifdef EXT_TEX_ENABLE
if ((read_addr == `CSR_TEX_UNIT)
|| (read_addr >= `CSR_TEX_STATE_BEGIN
&& read_addr < `CSR_TEX_STATE_END)) begin
read_addr_valid_r = 1;
end else
`endif
read_addr_valid_r = 0;
end
endcase
end
`RUNTIME_ASSERT(~read_enable || read_addr_valid_r, ("%t: *** invalid CSR read address: %0h (#%0d)", $time, read_addr, read_uuid))
assign read_data = read_data_r;
`ifdef EXT_F_ENABLE
assign fpu_to_csr_if.read_frm = fcsr[fpu_to_csr_if.read_wid][`INST_FRM_BITS+`FFLAGS_BITS-1:`FFLAGS_BITS];
`endif
endmodule

View File

@@ -1,151 +0,0 @@
`include "VX_define.vh"
module VX_csr_unit #(
parameter CORE_ID = 0
) (
input wire clk,
input wire reset,
`ifdef PERF_ENABLE
`ifdef EXT_TEX_ENABLE
VX_perf_tex_if.slave perf_tex_if,
`endif
VX_perf_memsys_if.slave perf_memsys_if,
VX_perf_pipeline_if.slave perf_pipeline_if,
`endif
VX_cmt_to_csr_if.slave cmt_to_csr_if,
VX_fetch_to_csr_if.slave fetch_to_csr_if,
VX_csr_req_if.slave csr_req_if,
VX_commit_if.master csr_commit_if,
`ifdef EXT_F_ENABLE
VX_fpu_to_csr_if.slave fpu_to_csr_if,
input wire[`NUM_WARPS-1:0] fpu_pending,
`endif
`ifdef EXT_TEX_ENABLE
VX_tex_csr_if.master tex_csr_if,
`endif
output wire[`NUM_WARPS-1:0] pending,
input wire busy
);
wire csr_we_s1;
wire [`CSR_ADDR_BITS-1:0] csr_addr_s1;
wire [31:0] csr_read_data;
wire [31:0] csr_read_data_s1;
wire [31:0] csr_updated_data_s1;
wire write_enable = csr_commit_if.valid && csr_we_s1;
wire [31:0] csr_req_data = csr_req_if.use_imm ? 32'(csr_req_if.imm) : csr_req_if.rs1_data;
VX_csr_data #(
.CORE_ID(CORE_ID)
) csr_data (
.clk (clk),
.reset (reset),
`ifdef PERF_ENABLE
`ifdef EXT_TEX_ENABLE
.perf_tex_if (perf_tex_if),
`endif
.perf_memsys_if (perf_memsys_if),
.perf_pipeline_if(perf_pipeline_if),
`endif
.cmt_to_csr_if (cmt_to_csr_if),
.fetch_to_csr_if(fetch_to_csr_if),
`ifdef EXT_F_ENABLE
.fpu_to_csr_if (fpu_to_csr_if),
`endif
`ifdef EXT_TEX_ENABLE
.tex_csr_if (tex_csr_if),
`endif
.read_enable (csr_req_if.valid),
.read_uuid (csr_req_if.uuid),
.read_addr (csr_req_if.addr),
.read_wid (csr_req_if.wid),
.read_data (csr_read_data),
.write_enable (write_enable),
.write_uuid (csr_commit_if.uuid),
.write_addr (csr_addr_s1),
.write_wid (csr_commit_if.wid),
.write_data (csr_updated_data_s1),
.busy (busy)
);
wire write_hazard = (csr_addr_s1 == csr_req_if.addr)
&& (csr_commit_if.wid == csr_req_if.wid)
&& csr_commit_if.valid;
wire [31:0] csr_read_data_qual = write_hazard ? csr_updated_data_s1 : csr_read_data;
reg [31:0] csr_updated_data;
reg csr_we_s0_unqual;
always @(*) begin
csr_we_s0_unqual = (csr_req_data != 0);
case (csr_req_if.op_type)
`INST_CSR_RW: begin
csr_updated_data = csr_req_data;
csr_we_s0_unqual = 1;
end
`INST_CSR_RS: begin
csr_updated_data = csr_read_data_qual | csr_req_data;
end
//`INST_CSR_RC
default: begin
csr_updated_data = csr_read_data_qual & ~csr_req_data;
end
endcase
end
`ifdef EXT_F_ENABLE
wire stall_in = fpu_pending[csr_req_if.wid];
`else
wire stall_in = 0;
`endif
wire csr_req_valid = csr_req_if.valid && !stall_in;
wire stall_out = ~csr_commit_if.ready && csr_commit_if.valid;
VX_pipe_register #(
.DATAW (1 + `UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `NR_BITS + 1 + 1 + `CSR_ADDR_BITS + 32 + 32),
.RESETW (1)
) pipe_reg (
.clk (clk),
.reset (reset),
.enable (!stall_out),
.data_in ({csr_req_valid, csr_req_if.uuid, csr_req_if.wid, csr_req_if.tmask, csr_req_if.PC, csr_req_if.rd, csr_req_if.wb, csr_we_s0_unqual, csr_req_if.addr, csr_read_data_qual, csr_updated_data}),
.data_out ({csr_commit_if.valid, csr_commit_if.uuid, csr_commit_if.wid, csr_commit_if.tmask, csr_commit_if.PC, csr_commit_if.rd, csr_commit_if.wb, csr_we_s1, csr_addr_s1, csr_read_data_s1, csr_updated_data_s1})
);
for (genvar i = 0; i < `NUM_THREADS; i++) begin
assign csr_commit_if.data[i] = (csr_addr_s1 == `CSR_WTID) ? i :
(csr_addr_s1 == `CSR_LTID
|| csr_addr_s1 == `CSR_GTID) ? (csr_read_data_s1 * `NUM_THREADS + i) :
csr_read_data_s1;
end
assign csr_commit_if.eop = 1'b1;
// can accept new request?
assign csr_req_if.ready = ~(stall_out || stall_in);
// pending request
reg [`NUM_WARPS-1:0] pending_r;
always @(posedge clk) begin
if (reset) begin
pending_r <= 0;
end else begin
if (csr_commit_if.valid && csr_commit_if.ready) begin
pending_r[csr_commit_if.wid] <= 0;
end
if (csr_req_if.valid && csr_req_if.ready) begin
pending_r[csr_req_if.wid] <= 1;
end
end
end
assign pending = pending_r;
endmodule

View File

@@ -1,495 +0,0 @@
`include "VX_define.vh"
`ifdef DBG_TRACE_CORE_PIPELINE
`include "VX_trace_instr.vh"
`endif
`ifdef EXT_F_ENABLE
`define USED_IREG(r) \
r``_r = {1'b0, ``r}
`define USED_FREG(r) \
r``_r = {1'b1, ``r}
`else
`define USED_IREG(r) \
r``_r = ``r
`endif
module VX_decode #(
parameter CORE_ID = 0
) (
input wire clk,
input wire reset,
`ifdef PERF_ENABLE
VX_perf_pipeline_if.decode perf_decode_if,
`endif
// inputs
VX_ifetch_rsp_if.slave ifetch_rsp_if,
// outputs
VX_decode_if.master decode_if,
VX_wstall_if.master wstall_if,
VX_join_if.master join_if
);
`UNUSED_PARAM (CORE_ID)
`UNUSED_VAR (clk)
`UNUSED_VAR (reset)
reg [`EX_BITS-1:0] ex_type;
reg [`INST_OP_BITS-1:0] op_type;
reg [`INST_MOD_BITS-1:0] op_mod;
reg [`NR_BITS-1:0] rd_r, rs1_r, rs2_r, rs3_r;
reg [31:0] imm;
reg use_rd, use_PC, use_imm;
reg is_join, is_wstall;
wire [31:0] instr = ifetch_rsp_if.data;
wire [6:0] opcode = instr[6:0];
wire [1:0] func2 = instr[26:25];
wire [2:0] func3 = instr[14:12];
wire [6:0] func7 = instr[31:25];
wire [11:0] u_12 = instr[31:20];
wire [4:0] rd = instr[11:7];
wire [4:0] rs1 = instr[19:15];
wire [4:0] rs2 = instr[24:20];
wire [4:0] rs3 = instr[31:27];
wire [19:0] upper_imm = {func7, rs2, rs1, func3};
wire [11:0] alu_imm = (func3[0] && ~func3[1]) ? {{7{1'b0}}, rs2} : u_12;
wire [11:0] s_imm = {func7, rd};
wire [12:0] b_imm = {instr[31], instr[7], instr[30:25], instr[11:8], 1'b0};
wire [20:0] jal_imm = {instr[31], instr[19:12], instr[20], instr[30:21], 1'b0};
`UNUSED_VAR (rs3)
always @(*) begin
ex_type = 0;
op_type = 'x;
op_mod = 0;
rd_r = 0;
rs1_r = 0;
rs2_r = 0;
rs3_r = 0;
imm = 'x;
use_imm = 0;
use_PC = 0;
use_rd = 0;
is_join = 0;
is_wstall = 0;
case (opcode)
`INST_I: begin
ex_type = `EX_ALU;
case (func3)
3'h0: op_type = `INST_OP_BITS'(`INST_ALU_ADD);
3'h1: op_type = `INST_OP_BITS'(`INST_ALU_SLL);
3'h2: op_type = `INST_OP_BITS'(`INST_ALU_SLT);
3'h3: op_type = `INST_OP_BITS'(`INST_ALU_SLTU);
3'h4: op_type = `INST_OP_BITS'(`INST_ALU_XOR);
3'h5: op_type = (func7[5]) ? `INST_OP_BITS'(`INST_ALU_SRA) : `INST_OP_BITS'(`INST_ALU_SRL);
3'h6: op_type = `INST_OP_BITS'(`INST_ALU_OR);
3'h7: op_type = `INST_OP_BITS'(`INST_ALU_AND);
default:;
endcase
use_rd = 1;
use_imm = 1;
imm = {{20{alu_imm[11]}}, alu_imm};
`USED_IREG (rd);
`USED_IREG (rs1);
end
`INST_R: begin
ex_type = `EX_ALU;
`ifdef EXT_F_ENABLE
if (func7[0]) begin
case (func3)
3'h0: op_type = `INST_OP_BITS'(`INST_MUL_MUL);
3'h1: op_type = `INST_OP_BITS'(`INST_MUL_MULH);
3'h2: op_type = `INST_OP_BITS'(`INST_MUL_MULHSU);
3'h3: op_type = `INST_OP_BITS'(`INST_MUL_MULHU);
3'h4: op_type = `INST_OP_BITS'(`INST_MUL_DIV);
3'h5: op_type = `INST_OP_BITS'(`INST_MUL_DIVU);
3'h6: op_type = `INST_OP_BITS'(`INST_MUL_REM);
3'h7: op_type = `INST_OP_BITS'(`INST_MUL_REMU);
default:;
endcase
op_mod = 2;
end else
`endif
begin
case (func3)
3'h0: op_type = (func7[5]) ? `INST_OP_BITS'(`INST_ALU_SUB) : `INST_OP_BITS'(`INST_ALU_ADD);
3'h1: op_type = `INST_OP_BITS'(`INST_ALU_SLL);
3'h2: op_type = `INST_OP_BITS'(`INST_ALU_SLT);
3'h3: op_type = `INST_OP_BITS'(`INST_ALU_SLTU);
3'h4: op_type = `INST_OP_BITS'(`INST_ALU_XOR);
3'h5: op_type = (func7[5]) ? `INST_OP_BITS'(`INST_ALU_SRA) : `INST_OP_BITS'(`INST_ALU_SRL);
3'h6: op_type = `INST_OP_BITS'(`INST_ALU_OR);
3'h7: op_type = `INST_OP_BITS'(`INST_ALU_AND);
default:;
endcase
end
use_rd = 1;
`USED_IREG (rd);
`USED_IREG (rs1);
`USED_IREG (rs2);
end
`INST_LUI: begin
ex_type = `EX_ALU;
op_type = `INST_OP_BITS'(`INST_ALU_LUI);
use_rd = 1;
use_imm = 1;
imm = {upper_imm, 12'(0)};
`USED_IREG (rd);
rs1_r = 0;
end
`INST_AUIPC: begin
ex_type = `EX_ALU;
op_type = `INST_OP_BITS'(`INST_ALU_AUIPC);
use_rd = 1;
use_imm = 1;
use_PC = 1;
imm = {upper_imm, 12'(0)};
`USED_IREG (rd);
end
`INST_JAL: begin
ex_type = `EX_ALU;
op_type = `INST_OP_BITS'(`INST_BR_JAL);
op_mod = 1;
use_rd = 1;
use_imm = 1;
use_PC = 1;
is_wstall = 1;
imm = {{11{jal_imm[20]}}, jal_imm};
`USED_IREG (rd);
end
`INST_JALR: begin
ex_type = `EX_ALU;
op_type = `INST_OP_BITS'(`INST_BR_JALR);
op_mod = 1;
use_rd = 1;
use_imm = 1;
is_wstall = 1;
imm = {{20{u_12[11]}}, u_12};
`USED_IREG (rd);
`USED_IREG (rs1);
end
`INST_B: begin
ex_type = `EX_ALU;
case (func3)
3'h0: op_type = `INST_OP_BITS'(`INST_BR_EQ);
3'h1: op_type = `INST_OP_BITS'(`INST_BR_NE);
3'h4: op_type = `INST_OP_BITS'(`INST_BR_LT);
3'h5: op_type = `INST_OP_BITS'(`INST_BR_GE);
3'h6: op_type = `INST_OP_BITS'(`INST_BR_LTU);
3'h7: op_type = `INST_OP_BITS'(`INST_BR_GEU);
default:;
endcase
op_mod = 1;
use_imm = 1;
use_PC = 1;
is_wstall = 1;
imm = {{19{b_imm[12]}}, b_imm};
`USED_IREG (rs1);
`USED_IREG (rs2);
end
`INST_FENCE: begin
ex_type = `EX_LSU;
op_mod = `INST_MOD_BITS'(1);
end
`INST_SYS : begin
if (func3[1:0] != 0) begin
ex_type = `EX_CSR;
op_type = `INST_OP_BITS'(func3[1:0]);
use_rd = 1;
use_imm = func3[2];
imm[`CSR_ADDR_BITS-1:0] = u_12; // addr
`USED_IREG (rd);
if (func3[2]) begin
imm[`CSR_ADDR_BITS +: `NRI_BITS] = rs1; // imm
end else begin
`USED_IREG (rs1);
end
end else begin
ex_type = `EX_ALU;
case (u_12)
12'h000: op_type = `INST_OP_BITS'(`INST_BR_ECALL);
12'h001: op_type = `INST_OP_BITS'(`INST_BR_EBREAK);
12'h002: op_type = `INST_OP_BITS'(`INST_BR_URET);
12'h102: op_type = `INST_OP_BITS'(`INST_BR_SRET);
12'h302: op_type = `INST_OP_BITS'(`INST_BR_MRET);
default:;
endcase
op_mod = 1;
use_rd = 1;
use_imm = 1;
use_PC = 1;
is_wstall = 1;
imm = 32'd4;
`USED_IREG (rd);
end
end
`ifdef EXT_F_ENABLE
`INST_FL,
`endif
`INST_L: begin
ex_type = `EX_LSU;
op_type = `INST_OP_BITS'({1'b0, func3});
use_rd = 1;
imm = {{20{u_12[11]}}, u_12};
`ifdef EXT_F_ENABLE
if (opcode[2]) begin
`USED_FREG (rd);
end else
`endif
`USED_IREG (rd);
`USED_IREG (rs1);
end
`ifdef EXT_F_ENABLE
`INST_FS,
`endif
`INST_S: begin
ex_type = `EX_LSU;
op_type = `INST_OP_BITS'({1'b1, func3});
imm = {{20{s_imm[11]}}, s_imm};
`USED_IREG (rs1);
`ifdef EXT_F_ENABLE
if (opcode[2]) begin
`USED_FREG (rs2);
end else
`endif
`USED_IREG (rs2);
end
`ifdef EXT_F_ENABLE
`INST_FMADD,
`INST_FMSUB,
`INST_FNMSUB,
`INST_FNMADD: begin
ex_type = `EX_FPU;
op_type = `INST_OP_BITS'(opcode[3:0]);
op_mod = func3;
use_rd = 1;
`USED_FREG (rd);
`USED_FREG (rs1);
`USED_FREG (rs2);
`USED_FREG (rs3);
end
`INST_FCI: begin
ex_type = `EX_FPU;
op_mod = func3;
use_rd = 1;
case (func7)
7'h00, // FADD
7'h04, // FSUB
7'h08, // FMUL
7'h0C: begin // FDIV
op_type = `INST_OP_BITS'(func7[3:0]);
`USED_FREG (rd);
`USED_FREG (rs1);
`USED_FREG (rs2);
end
7'h2C: begin
op_type = `INST_OP_BITS'(`INST_FPU_SQRT);
`USED_FREG (rd);
`USED_FREG (rs1);
end
7'h50: begin
op_type = `INST_OP_BITS'(`INST_FPU_CMP);
`USED_IREG (rd);
`USED_FREG (rs1);
`USED_FREG (rs2);
end
7'h60: begin
op_type = (instr[20]) ? `INST_OP_BITS'(`INST_FPU_CVTWUS) : `INST_OP_BITS'(`INST_FPU_CVTWS);
`USED_IREG (rd);
`USED_FREG (rs1);
end
7'h68: begin
op_type = (instr[20]) ? `INST_OP_BITS'(`INST_FPU_CVTSWU) : `INST_OP_BITS'(`INST_FPU_CVTSW);
`USED_FREG (rd);
`USED_IREG (rs1);
end
7'h10: begin
// FSGNJ=0, FSGNJN=1, FSGNJX=2
op_type = `INST_OP_BITS'(`INST_FPU_MISC);
op_mod = {1'b0, func3[1:0]};
`USED_FREG (rd);
`USED_FREG (rs1);
`USED_FREG (rs2);
end
7'h14: begin
// FMIN=3, FMAX=4
op_type = `INST_OP_BITS'(`INST_FPU_MISC);
op_mod = func3[0] ? 4 : 3;
`USED_FREG (rd);
`USED_FREG (rs1);
`USED_FREG (rs2);
end
7'h70: begin
if (func3[0]) begin
// FCLASS
op_type = `INST_OP_BITS'(`INST_FPU_CLASS);
end else begin
// FMV.X.W=5
op_type = `INST_OP_BITS'(`INST_FPU_MISC);
op_mod = 5;
end
`USED_IREG (rd);
`USED_FREG (rs1);
end
7'h78: begin
// FMV.W.X=6
op_type = `INST_OP_BITS'(`INST_FPU_MISC);
op_mod = 6;
`USED_FREG (rd);
`USED_IREG (rs1);
end
default:;
endcase
end
`endif
`INST_GPGPU: begin
ex_type = `EX_GPU;
case (func3)
3'h0: begin
op_type = rs2[0] ? `INST_OP_BITS'(`INST_GPU_PRED) : `INST_OP_BITS'(`INST_GPU_TMC);
is_wstall = 1;
`USED_IREG (rs1);
end
3'h1: begin
op_type = `INST_OP_BITS'(`INST_GPU_WSPAWN);
`USED_IREG (rs1);
`USED_IREG (rs2);
end
3'h2: begin
op_type = `INST_OP_BITS'(`INST_GPU_SPLIT);
is_wstall = 1;
`USED_IREG (rs1);
end
3'h3: begin
op_type = `INST_OP_BITS'(`INST_GPU_JOIN);
is_join = 1;
end
3'h4: begin
op_type = `INST_OP_BITS'(`INST_GPU_BAR);
is_wstall = 1;
`USED_IREG (rs1);
`USED_IREG (rs2);
end
3'h5: begin
ex_type = `EX_LSU;
op_type = `INST_OP_BITS'(`INST_LSU_LW);
op_mod = `INST_MOD_BITS'(2);
`USED_IREG (rs1);
end
default:;
endcase
end
`INST_GPU: begin
case (func3)
`ifdef EXT_TEX_ENABLE
3'h0: begin
ex_type = `EX_GPU;
op_type = `INST_OP_BITS'(`INST_GPU_TEX);
op_mod = `INST_MOD_BITS'(func2);
use_rd = 1;
`USED_IREG (rd);
`USED_IREG (rs1);
`USED_IREG (rs2);
`USED_IREG (rs3);
end
`endif
default:;
endcase
end
default:;
endcase
end
`UNUSED_VAR (func2)
// disable write to integer register r0
wire wb = use_rd && (| rd_r);
assign decode_if.valid = ifetch_rsp_if.valid;
assign decode_if.uuid = ifetch_rsp_if.uuid;
assign decode_if.wid = ifetch_rsp_if.wid;
assign decode_if.tmask = ifetch_rsp_if.tmask;
assign decode_if.PC = ifetch_rsp_if.PC;
assign decode_if.ex_type = ex_type;
assign decode_if.op_type = op_type;
assign decode_if.op_mod = op_mod;
assign decode_if.wb = wb;
assign decode_if.rd = rd_r;
assign decode_if.rs1 = rs1_r;
assign decode_if.rs2 = rs2_r;
assign decode_if.rs3 = rs3_r;
assign decode_if.imm = imm;
assign decode_if.use_PC = use_PC;
assign decode_if.use_imm = use_imm;
///////////////////////////////////////////////////////////////////////////
wire ifetch_rsp_fire = ifetch_rsp_if.valid && ifetch_rsp_if.ready;
assign join_if.valid = ifetch_rsp_fire && is_join;
assign join_if.wid = ifetch_rsp_if.wid;
assign wstall_if.valid = ifetch_rsp_fire;
assign wstall_if.wid = ifetch_rsp_if.wid;
assign wstall_if.stalled = is_wstall;
assign ifetch_rsp_if.ready = decode_if.ready;
`ifdef PERF_ENABLE
wire [$clog2(`NUM_THREADS+1)-1:0] perf_loads_per_cycle;
wire [$clog2(`NUM_THREADS+1)-1:0] perf_stores_per_cycle;
wire [$clog2(`NUM_THREADS+1)-1:0] perf_branches_per_cycle;
wire [`NUM_THREADS-1:0] perf_loads_per_mask = decode_if.tmask & {`NUM_THREADS{decode_if.ex_type == `EX_LSU && `INST_LSU_IS_MEM(decode_if.op_mod) && decode_if.wb}};
wire [`NUM_THREADS-1:0] perf_stores_per_mask = decode_if.tmask & {`NUM_THREADS{decode_if.ex_type == `EX_LSU && `INST_LSU_IS_MEM(decode_if.op_mod) && ~decode_if.wb}};
wire [`NUM_THREADS-1:0] perf_branches_per_mask = decode_if.tmask & {`NUM_THREADS{decode_if.ex_type == `EX_ALU && `INST_ALU_IS_BR(decode_if.op_mod)}};
`POP_COUNT(perf_loads_per_cycle, perf_loads_per_mask);
`POP_COUNT(perf_stores_per_cycle, perf_stores_per_mask);
`POP_COUNT(perf_branches_per_cycle, perf_branches_per_mask);
reg [`PERF_CTR_BITS-1:0] perf_loads;
reg [`PERF_CTR_BITS-1:0] perf_stores;
reg [`PERF_CTR_BITS-1:0] perf_branches;
always @(posedge clk) begin
if (reset) begin
perf_loads <= 0;
perf_stores <= 0;
perf_branches <= 0;
end else begin
if (decode_if.valid && decode_if.ready) begin
perf_loads <= perf_loads + `PERF_CTR_BITS'(perf_loads_per_cycle);
perf_stores <= perf_stores + `PERF_CTR_BITS'(perf_stores_per_cycle);
perf_branches <= perf_branches + `PERF_CTR_BITS'(perf_branches_per_cycle);
end
end
end
assign perf_decode_if.loads = perf_loads;
assign perf_decode_if.stores = perf_stores;
assign perf_decode_if.branches = perf_branches;
`endif
`ifdef DBG_TRACE_CORE_PIPELINE
always @(posedge clk) begin
if (decode_if.valid && decode_if.ready) begin
dpi_trace("%d: core%0d-decode: wid=%0d, PC=%0h, ex=", $time, CORE_ID, decode_if.wid, decode_if.PC);
trace_ex_type(decode_if.ex_type);
dpi_trace(", op=");
trace_ex_op(decode_if.ex_type, decode_if.op_type, decode_if.op_mod);
dpi_trace(", mod=%0d, tmask=%b, wb=%b, rd=%0d, rs1=%0d, rs2=%0d, rs3=%0d, imm=%0h, use_pc=%b, use_imm=%b (#%0d)\n",
decode_if.op_mod, decode_if.tmask, decode_if.wb, decode_if.rd, decode_if.rs1, decode_if.rs2, decode_if.rs3, decode_if.imm, decode_if.use_PC, decode_if.use_imm, decode_if.uuid);
end
end
`endif
endmodule

View File

@@ -1,24 +1,40 @@
`ifndef VX_DEFINE
`define VX_DEFINE
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
`ifndef VX_DEFINE_VH
`define VX_DEFINE_VH
`include "VX_platform.vh"
`include "VX_config.vh"
`include "VX_types.vh"
///////////////////////////////////////////////////////////////////////////////
`define NW_BITS `LOG2UP(`NUM_WARPS)
`define NW_BITS `CLOG2(`NUM_WARPS)
`define NC_WIDTH `UP(`NC_BITS)
`define NT_BITS `LOG2UP(`NUM_THREADS)
`define NT_BITS `CLOG2(`NUM_THREADS)
`define NW_WIDTH `UP(`NW_BITS)
`define NC_BITS `LOG2UP(`NUM_CORES)
`define NC_BITS `CLOG2(`NUM_CORES)
`define NT_WIDTH `UP(`NT_BITS)
`define NB_BITS `LOG2UP(`NUM_BARRIERS)
`define NB_BITS `CLOG2(`NUM_BARRIERS)
`define NB_WIDTH `UP(`NB_BITS)
`define NUM_IREGS 32
`define NRI_BITS `LOG2UP(`NUM_IREGS)
`define NTEX_BITS `LOG2UP(`NUM_TEX_UNITS)
`define NRI_BITS `CLOG2(`NUM_IREGS)
`ifdef EXT_F_ENABLE
`define NUM_REGS (2 * `NUM_IREGS)
@@ -26,25 +42,34 @@
`define NUM_REGS `NUM_IREGS
`endif
`define NR_BITS `LOG2UP(`NUM_REGS)
`define CSR_ADDR_BITS 12
`define CSR_WIDTH 12
`define NR_BITS `CLOG2(`NUM_REGS)
`define PERF_CTR_BITS 44
`define UUID_BITS 44
`ifndef NDEBUG
`define UUID_WIDTH 44
`else
`define UUID_WIDTH 1
`endif
///////////////////////////////////////////////////////////////////////////////
`define EX_NOP 3'h0
`define EX_ALU 3'h1
`define EX_LSU 3'h2
`define EX_CSR 3'h3
`define EX_FPU 3'h4
`define EX_GPU 3'h5
`define EX_BITS 3
`define EX_ALU 0
`define EX_LSU 1
`define EX_SFU 2
`define EX_FPU (`EX_SFU + `EXT_F_ENABLED)
`define EX_TENSOR (`EX_FPU + `EXT_T_ENABLED)
`define NUM_EX_UNITS (3 + `EXT_F_ENABLED + `EXT_T_ENABLED)
`define EX_BITS `CLOG2(`NUM_EX_UNITS)
`define EX_WIDTH `UP(`EX_BITS)
`define SFU_CSRS 0
`define SFU_WCTL 1
`define NUM_SFU_UNITS (2)
`define SFU_BITS `CLOG2(`NUM_SFU_UNITS)
`define SFU_WIDTH `UP(`SFU_BITS)
///////////////////////////////////////////////////////////////////////////////
@@ -60,6 +85,10 @@
`define INST_FENCE 7'b0001111 // Fence instructions
`define INST_SYS 7'b1110011 // system instructions
// RV64I instruction specific opcodes (for any W instruction)
`define INST_I_W 7'b0011011 // W type immediate instructions
`define INST_R_W 7'b0111011 // W type register instructions
`define INST_FL 7'b0000111 // float load instruction
`define INST_FS 7'b0100111 // float store instruction
`define INST_FMADD 7'b1000011
@@ -68,10 +97,11 @@
`define INST_FNMADD 7'b1001111
`define INST_FCI 7'b1010011 // float common instructions
`define INST_GPGPU 7'b1101011
`define INST_GPU 7'b1011011
`define INST_TEX 7'b0101011
// Custom extension opcodes
`define INST_EXT1 7'b0001011 // 0x0B
`define INST_EXT2 7'b0101011 // 0x2B
`define INST_EXT3 7'b1011011 // 0x5B
`define INST_EXT4 7'b1111011 // 0x7B
///////////////////////////////////////////////////////////////////////////////
@@ -86,7 +116,8 @@
///////////////////////////////////////////////////////////////////////////////
`define INST_OP_BITS 4
`define INST_MOD_BITS 3
`define INST_MOD_BITS 4
`define INST_FMT_BITS 2
///////////////////////////////////////////////////////////////////////////////
@@ -95,20 +126,22 @@
`define INST_ALU_AUIPC 4'b0011
`define INST_ALU_SLTU 4'b0100
`define INST_ALU_SLT 4'b0101
`define INST_ALU_SUB 4'b0111
`define INST_ALU_SRL 4'b1000
`define INST_ALU_SRA 4'b1001
`define INST_ALU_SUB 4'b1011
`define INST_ALU_AND 4'b1100
`define INST_ALU_OR 4'b1101
`define INST_ALU_XOR 4'b1110
`define INST_ALU_SLL 4'b1111
`define INST_ALU_OTHER 4'b0111
`define INST_ALU_BITS 4
`define INST_ALU_OP(x) x[`INST_ALU_BITS-1:0]
`define INST_ALU_OP_CLASS(x) x[3:2]
`define INST_ALU_SIGNED(x) x[0]
`define INST_ALU_IS_BR(x) x[0]
`define INST_ALU_IS_MUL(x) x[1]
`define INST_ALU_CLASS(op) op[3:2]
`define INST_ALU_SIGNED(op) op[0]
`define INST_ALU_IS_SUB(op) op[1]
`define INST_ALU_IS_BR(mod) mod[0]
`define INST_ALU_IS_M(mod) mod[1]
`define INST_ALU_IS_W(mod) mod[2]
`define INST_ALU_IS_RED(mod) mod[3]
`define INST_BR_EQ 4'b0000
`define INST_BR_NE 4'b0010
@@ -125,292 +158,307 @@
`define INST_BR_MRET 4'b1110
`define INST_BR_OTHER 4'b1111
`define INST_BR_BITS 4
`define INST_BR_NEG(x) x[1]
`define INST_BR_LESS(x) x[2]
`define INST_BR_STATIC(x) x[3]
`define INST_BR_CLASS(op) {1'b0, ~op[3]}
`define INST_BR_IS_NEG(op) op[1]
`define INST_BR_IS_LESS(op) op[2]
`define INST_BR_IS_STATIC(op) op[3]
`define INST_MUL_MUL 3'h0
`define INST_MUL_MULH 3'h1
`define INST_MUL_MULHSU 3'h2
`define INST_MUL_MULHU 3'h3
`define INST_MUL_DIV 3'h4
`define INST_MUL_DIVU 3'h5
`define INST_MUL_REM 3'h6
`define INST_MUL_REMU 3'h7
`define INST_MUL_BITS 3
`define INST_MUL_IS_DIV(x) x[2]
`define INST_M_MUL 3'b000
`define INST_M_MULHU 3'b001
`define INST_M_MULH 3'b010
`define INST_M_MULHSU 3'b011
`define INST_M_DIV 3'b100
`define INST_M_DIVU 3'b101
`define INST_M_REM 3'b110
`define INST_M_REMU 3'b111
`define INST_M_BITS 3
`define INST_M_SIGNED(op) (~op[0])
`define INST_M_IS_MULX(op) (~op[2])
`define INST_M_IS_MULH(op) (op[1:0] != 0)
`define INST_M_SIGNED_A(op) (op[1:0] != 1)
`define INST_M_IS_REM(op) op[1]
`define INST_RED_ADD 4'b0000
`define INST_RED_ADDU 4'b1000
`define INST_RED_MIN 4'b0001
`define INST_RED_MINU 4'b1001
`define INST_RED_MAX 4'b0010
`define INST_RED_MAXU 4'b1010
`define INST_RED_AND 4'b0011
`define INST_RED_OR 4'b0100
`define INST_RED_XOR 4'b0101
`define INST_RED_BITS 4
`define INST_FMT_B 3'b000
`define INST_FMT_H 3'b001
`define INST_FMT_W 3'b010
`define INST_FMT_D 3'b011
`define INST_FMT_BU 3'b100
`define INST_FMT_HU 3'b101
`define INST_FMT_WU 3'b110
`define INST_LSU_LB 4'b0000
`define INST_LSU_LH 4'b0001
`define INST_LSU_LW 4'b0010
`define INST_LSU_LD 4'b0011 // new for RV64I LD
`define INST_LSU_LBU 4'b0100
`define INST_LSU_LHU 4'b0101
`define INST_LSU_LWU 4'b0110 // new for RV64I LWU
`define INST_LSU_SB 4'b1000
`define INST_LSU_SH 4'b1001
`define INST_LSU_SW 4'b1010
`define INST_LSU_SD 4'b1011 // new for RV64I SD
`define INST_LSU_FENCE 4'b1111
`define INST_LSU_BITS 4
`define INST_LSU_FMT(x) x[2:0]
`define INST_LSU_WSIZE(x) x[1:0]
`define INST_LSU_IS_MEM(x) (3'h0 == x)
`define INST_LSU_IS_FENCE(x) (3'h1 == x)
`define INST_LSU_IS_PREFETCH(x) (3'h2 == x)
`define INST_LSU_FMT(op) op[2:0]
`define INST_LSU_WSIZE(op) op[1:0]
`define INST_LSU_IS_FENCE(op) (op[3:2] == 3)
`define INST_FENCE_BITS 1
`define INST_FENCE_D 1'h0
`define INST_FENCE_I 1'h1
`define INST_CSR_RW 2'h1
`define INST_CSR_RS 2'h2
`define INST_CSR_RC 2'h3
`define INST_CSR_OTHER 2'h0
`define INST_CSR_BITS 2
`define INST_FPU_ADD 4'h0
`define INST_FPU_SUB 4'h4
`define INST_FPU_MUL 4'h8
`define INST_FPU_DIV 4'hC
`define INST_FPU_CVTWS 4'h1 // FCVT.W.S
`define INST_FPU_CVTWUS 4'h5 // FCVT.WU.S
`define INST_FPU_CVTSW 4'h9 // FCVT.S.W
`define INST_FPU_CVTSWU 4'hD // FCVT.S.WU
`define INST_FPU_SQRT 4'h2
`define INST_FPU_CLASS 4'h6
`define INST_FPU_CMP 4'hA
`define INST_FPU_MISC 4'hE // SGNJ, SGNJN, SGNJX, FMIN, FMAX, MVXW, MVWX
`define INST_FPU_MADD 4'h3
`define INST_FPU_MSUB 4'h7
`define INST_FPU_NMSUB 4'hB
`define INST_FPU_NMADD 4'hF
`define INST_FPU_ADD 4'b0000
`define INST_FPU_SUB 4'b0001
`define INST_FPU_MUL 4'b0010
`define INST_FPU_DIV 4'b0011
`define INST_FPU_SQRT 4'b0100
`define INST_FPU_CMP 4'b0101 // mod: LE=0, LT=1, EQ=2
`define INST_FPU_F2F 4'b0110
`define INST_FPU_MISC 4'b0111 // mod: SGNJ=0, SGNJN=1, SGNJX=2, CLASS=3, MVXW=4, MVWX=5, FMIN=6, FMAX=7
`define INST_FPU_F2I 4'b1000
`define INST_FPU_F2U 4'b1001
`define INST_FPU_I2F 4'b1010
`define INST_FPU_U2F 4'b1011
`define INST_FPU_MADD 4'b1100
`define INST_FPU_MSUB 4'b1101
`define INST_FPU_NMSUB 4'b1110
`define INST_FPU_NMADD 4'b1111
`define INST_FPU_BITS 4
`define INST_FPU_IS_W(mod) (mod[4])
`define INST_FPU_IS_CLASS(op, mod) (op == `INST_FPU_MISC && mod == 3)
`define INST_FPU_IS_MVXW(op, mod) (op == `INST_FPU_MISC && mod == 4)
`define INST_GPU_TMC 4'h0
`define INST_GPU_WSPAWN 4'h1
`define INST_GPU_SPLIT 4'h2
`define INST_GPU_JOIN 4'h3
`define INST_GPU_BAR 4'h4
`define INST_GPU_PRED 4'h5
`define INST_GPU_TEX 4'h6
`define INST_GPU_BITS 4
`define INST_SFU_TMC 4'h0
`define INST_SFU_WSPAWN 4'h1
`define INST_SFU_SPLIT 4'h2
`define INST_SFU_JOIN 4'h3
`define INST_SFU_BAR 4'h4
`define INST_SFU_PRED 4'h5
`define INST_SFU_CSRRW 4'h6
`define INST_SFU_CSRRS 4'h7
`define INST_SFU_CSRRC 4'h8
`define INST_SFU_CMOV 4'h9
`define INST_SFU_BAR_MASK 4'ha
`define INST_SFU_BITS 4
`define INST_SFU_CSR(f3) (4'h6 + 4'(f3) - 4'h1)
`define INST_SFU_IS_WCTL(op) ((op <= 5) || (op == `INST_SFU_BAR_MASK))
`define INST_SFU_IS_CSR(op) (op >= 6 && op <= 8)
///////////////////////////////////////////////////////////////////////////////
`define INST_TENSOR_HMMA 4'b0000
// Hopper WGMMA-style asynchronous op
`define INST_TENSOR_HGMMA 4'b0001
`define INST_TENSOR_HGMMA_WAIT 4'b0010
`define INST_TENSOR_TCGEN05_CP 4'b0011
`define INST_TENSOR_TCGEN05_CP_WAIT 4'b0100
`define INST_TENSOR_BWGMMA 4'b0101
`define INST_TENSOR_BWGMMA_WAIT 4'b0110
`define INST_TENSOR_TCGEN05_LD 4'b0111
`define INST_TENSOR_TCGEN05_ST 4'b1000
`define INST_TENSOR_TCGEN05_CB 4'b1001
`ifdef EXT_M_ENABLE
`define ISA_EXT_M (1 << 12)
`else
`define ISA_EXT_M 0
`ifdef EXT_T_HOPPER
`define EXT_T_ASYNC
`elsif EXT_T_BLACKWELL
`define EXT_T_ASYNC
`endif
`ifdef EXT_F_ENABLE
`define ISA_EXT_F (1 << 5)
`else
`define ISA_EXT_F 0
`endif
`define ISA_CODE (0 << 0) // A - Atomic Instructions extension \
| (0 << 1) // B - Tentatively reserved for Bit operations extension \
| (0 << 2) // C - Compressed extension \
| (0 << 3) // D - Double precsision floating-point extension \
| (0 << 4) // E - RV32E base ISA \
|`ISA_EXT_F // F - Single precsision floating-point extension \
| (0 << 6) // G - Additional standard extensions present \
| (0 << 7) // H - Hypervisor mode implemented \
| (1 << 8) // I - RV32I/64I/128I base ISA \
| (0 << 9) // J - Reserved \
| (0 << 10) // K - Reserved \
| (0 << 11) // L - Tentatively reserved for Bit operations extension \
|`ISA_EXT_M // M - Integer Multiply/Divide extension \
| (0 << 13) // N - User level interrupts supported \
| (0 << 14) // O - Reserved \
| (0 << 15) // P - Tentatively reserved for Packed-SIMD extension \
| (0 << 16) // Q - Quad-precision floating-point extension \
| (0 << 17) // R - Reserved \
| (0 << 18) // S - Supervisor mode implemented \
| (0 << 19) // T - Tentatively reserved for Transactional Memory extension \
| (1 << 20) // U - User mode implemented \
| (0 << 21) // V - Tentatively reserved for Vector extension \
| (0 << 22) // W - Reserved \
| (1 << 23) // X - Non-standard extensions present \
| (0 << 24) // Y - Reserved \
| (0 << 25) // Z - Reserved
///////////////////////////////////////////////////////////////////////////////
// non-cacheable tag bits
`define NC_TAG_BIT 1
// texture tag bits
`define TEX_TAG_BIT 1
`define NC_TAG_BITS 1
// cache address type bits
`define CACHE_ADDR_TYPE_BITS (`NC_TAG_BIT + `SM_ENABLE)
////////////////////////// Icache Configurable Knobs //////////////////////////
// Cache ID
`define ICACHE_ID (32'(`L3_ENABLE) + 32'(`L2_ENABLE) * `NUM_CLUSTERS + CORE_ID * 3 + 0)
// Word size in bytes
`define ICACHE_WORD_SIZE 4
// Block size in bytes
`define ICACHE_LINE_SIZE `L1_BLOCK_SIZE
// TAG sharing enable
`define ICACHE_CORE_TAG_ID_BITS `NW_BITS
// Core request tag bits
`define ICACHE_CORE_TAG_WIDTH (`UUID_BITS + `ICACHE_CORE_TAG_ID_BITS)
// Memory request data bits
`define ICACHE_MEM_DATA_WIDTH (`ICACHE_LINE_SIZE * 8)
// Memory request address bits
`define ICACHE_MEM_ADDR_WIDTH (32 - `CLOG2(`ICACHE_LINE_SIZE))
// Memory request tag bits
`define ICACHE_MEM_TAG_WIDTH `CLOG2(`ICACHE_MSHR_SIZE)
////////////////////////// Dcache Configurable Knobs //////////////////////////
// Cache ID
`define DCACHE_ID (32'(`L3_ENABLE) + 32'(`L2_ENABLE) * `NUM_CLUSTERS + CORE_ID * 3 + 1)
// Word size in bytes
`define DCACHE_WORD_SIZE 4
// Block size in bytes
`define DCACHE_LINE_SIZE `L1_BLOCK_SIZE
// Core request tag bits
`define LSUQ_ADDR_BITS `LOG2UP(`LSUQ_SIZE)
`ifdef EXT_TEX_ENABLE
`define LSU_TAG_ID_BITS `MAX(`LSUQ_ADDR_BITS, 2)
`define LSU_TEX_DCACHE_TAG_BITS (`UUID_BITS + `LSU_TAG_ID_BITS + `CACHE_ADDR_TYPE_BITS)
`define DCACHE_CORE_TAG_ID_BITS (`LSU_TAG_ID_BITS + `CACHE_ADDR_TYPE_BITS + `TEX_TAG_BIT)
`else
`define LSU_TAG_ID_BITS `LSUQ_ADDR_BITS
`define DCACHE_CORE_TAG_ID_BITS (`LSU_TAG_ID_BITS + `CACHE_ADDR_TYPE_BITS)
`ifdef SM_ENABLE
`define CACHE_ADDR_TYPE_BITS (`NC_TAG_BITS + 1)
`else
`define CACHE_ADDR_TYPE_BITS `NC_TAG_BITS
`endif
`define DCACHE_CORE_TAG_WIDTH (`UUID_BITS + `DCACHE_CORE_TAG_ID_BITS)
// Memory request data bits
`define DCACHE_MEM_DATA_WIDTH (`DCACHE_LINE_SIZE * 8)
// Memory request address bits
`define DCACHE_MEM_ADDR_WIDTH (32 - `CLOG2(`DCACHE_LINE_SIZE))
// Memory byte enable bits
`define DCACHE_MEM_BYTEEN_WIDTH `DCACHE_LINE_SIZE
// Input request size
`define DCACHE_NUM_REQS `NUM_THREADS
// Memory request tag bits
`define _DMEM_ADDR_RATIO_W $clog2(`DCACHE_LINE_SIZE / `DCACHE_WORD_SIZE)
`define _DNC_MEM_TAG_WIDTH ($clog2(`DCACHE_NUM_REQS) + `_DMEM_ADDR_RATIO_W + `DCACHE_CORE_TAG_WIDTH)
`define DCACHE_MEM_TAG_WIDTH `MAX((`CLOG2(`DCACHE_NUM_BANKS) + `CLOG2(`DCACHE_MSHR_SIZE) + `NC_TAG_BIT), `_DNC_MEM_TAG_WIDTH)
// Merged D-cache/I-cache memory tag
`define L1_MEM_TAG_WIDTH (`MAX(`ICACHE_MEM_TAG_WIDTH, `DCACHE_MEM_TAG_WIDTH) + `CLOG2(2))
////////////////////////// SM Configurable Knobs //////////////////////////////
// Cache ID
`define SMEM_ID (32'(`L3_ENABLE) + 32'(`L2_ENABLE) * `NUM_CLUSTERS + CORE_ID * 3 + 2)
// Word size in bytes
`define SMEM_WORD_SIZE 4
// bank address offset
`define SMEM_BANK_ADDR_OFFSET `CLOG2(`STACK_SIZE / `SMEM_WORD_SIZE)
// Input request size
`define SMEM_NUM_REQS `NUM_THREADS
////////////////////////// L2cache Configurable Knobs /////////////////////////
// Cache ID
`define L2_CACHE_ID (32'(`L3_ENABLE) + CLUSTER_ID)
// Word size in bytes
`define L2_WORD_SIZE `DCACHE_LINE_SIZE
// Block size in bytes
`define L2_CACHE_LINE_SIZE ((`L2_ENABLE) ? `MEM_BLOCK_SIZE : `L2_WORD_SIZE)
// Input request tag bits
`define L2_CORE_TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH + `CLOG2(`NUM_CORES))
// Memory request data bits
`define L2_MEM_DATA_WIDTH (`L2_CACHE_LINE_SIZE * 8)
// Memory request address bits
`define L2_MEM_ADDR_WIDTH (32 - `CLOG2(`L2_CACHE_LINE_SIZE))
// Memory byte enable bits
`define L2_MEM_BYTEEN_WIDTH `L2_CACHE_LINE_SIZE
// Input request size
`define L2_NUM_REQS `NUM_CORES
// Memory request tag bits
`define _L2_MEM_ADDR_RATIO_W $clog2(`L2_CACHE_LINE_SIZE / `L2_WORD_SIZE)
`define _L2_NC_MEM_TAG_WIDTH ($clog2(`L2_NUM_REQS) + `_L2_MEM_ADDR_RATIO_W + `L1_MEM_TAG_WIDTH)
`define _L2_MEM_TAG_WIDTH `MAX((`CLOG2(`L2_NUM_BANKS) + `CLOG2(`L2_MSHR_SIZE) + `NC_TAG_BIT), `_L2_NC_MEM_TAG_WIDTH)
`define L2_MEM_TAG_WIDTH ((`L2_ENABLE) ? `_L2_MEM_TAG_WIDTH : (`L1_MEM_TAG_WIDTH + `CLOG2(`L2_NUM_REQS)))
////////////////////////// L3cache Configurable Knobs /////////////////////////
// Cache ID
`define L3_CACHE_ID 0
// Word size in bytes
`define L3_WORD_SIZE `L2_CACHE_LINE_SIZE
// Block size in bytes
`define L3_CACHE_LINE_SIZE ((`L3_ENABLE) ? `MEM_BLOCK_SIZE : `L3_WORD_SIZE)
// Input request tag bits
`define L3_CORE_TAG_WIDTH (`L2_CORE_TAG_WIDTH + `CLOG2(`NUM_CLUSTERS))
// Memory request data bits
`define L3_MEM_DATA_WIDTH (`L3_CACHE_LINE_SIZE * 8)
// Memory request address bits
`define L3_MEM_ADDR_WIDTH (32 - `CLOG2(`L3_CACHE_LINE_SIZE))
// Memory byte enable bits
`define L3_MEM_BYTEEN_WIDTH `L3_CACHE_LINE_SIZE
// Input request size
`define L3_NUM_REQS `NUM_CLUSTERS
// Memory request tag bits
`define _L3_MEM_ADDR_RATIO_W $clog2(`L3_CACHE_LINE_SIZE / `L3_WORD_SIZE)
`define _L3_NC_MEM_TAG_WIDTH ($clog2(`L3_NUM_REQS) + `_L3_MEM_ADDR_RATIO_W + `L2_MEM_TAG_WIDTH)
`define _L3_MEM_TAG_WIDTH `MAX((`CLOG2(`L3_NUM_BANKS) + `CLOG2(`L3_MSHR_SIZE) + `NC_TAG_BIT), `_L3_NC_MEM_TAG_WIDTH)
`define L3_MEM_TAG_WIDTH ((`L3_ENABLE) ? `_L3_MEM_TAG_WIDTH : (`L2_MEM_TAG_WIDTH + `CLOG2(`L3_NUM_REQS)))
`define ARB_SEL_BITS(I, O) ((I > O) ? `CLOG2((I + O - 1) / O) : 0)
///////////////////////////////////////////////////////////////////////////////
`define VX_MEM_BYTEEN_WIDTH `L3_MEM_BYTEEN_WIDTH
`define VX_MEM_ADDR_WIDTH `L3_MEM_ADDR_WIDTH
`define VX_MEM_DATA_WIDTH `L3_MEM_DATA_WIDTH
`define VX_MEM_TAG_WIDTH `L3_MEM_TAG_WIDTH
`define VX_CORE_TAG_WIDTH `L3_CORE_TAG_WIDTH
`define VX_CSR_ID_WIDTH `LOG2UP(`NUM_CLUSTERS * `NUM_CORES)
`define CACHE_MEM_TAG_WIDTH(mshr_size, num_banks) \
(`CLOG2(mshr_size) + `CLOG2(num_banks) + `NC_TAG_BITS)
`define CACHE_NC_BYPASS_TAG_WIDTH(num_reqs, line_size, word_size, tag_width) \
(`CLOG2(num_reqs) + `CLOG2(line_size / word_size) + tag_width)
`define TO_FULL_ADDR(x) {x, (32-$bits(x))'(0)}
`define CACHE_BYPASS_TAG_WIDTH(num_reqs, line_size, word_size, tag_width) \
(`CACHE_NC_BYPASS_TAG_WIDTH(num_reqs, line_size, word_size, tag_width) + `NC_TAG_BITS)
`define CACHE_NC_MEM_TAG_WIDTH(mshr_size, num_banks, num_reqs, line_size, word_size, tag_width) \
`MAX(`CACHE_MEM_TAG_WIDTH(mshr_size, num_banks), `CACHE_NC_BYPASS_TAG_WIDTH(num_reqs, line_size, word_size, tag_width))
///////////////////////////////////////////////////////////////////////////////
`include "VX_fpu_types.vh"
`include "VX_gpu_types.vh"
`define CACHE_CLUSTER_CORE_ARB_TAG(tag_width, num_inputs, num_caches) \
(tag_width + `ARB_SEL_BITS(num_inputs, `UP(num_caches)))
`define CACHE_CLUSTER_MEM_ARB_TAG(tag_width, num_caches) \
(tag_width + `ARB_SEL_BITS(`UP(num_caches), 1))
`define CACHE_CLUSTER_MEM_TAG_WIDTH(mshr_size, num_banks, num_caches) \
`CACHE_CLUSTER_MEM_ARB_TAG(`CACHE_MEM_TAG_WIDTH(mshr_size, num_banks), num_caches)
`define CACHE_CLUSTER_NC_BYPASS_TAG_WIDTH(num_reqs, line_size, word_size, tag_width, num_inputs, num_caches) \
`CACHE_CLUSTER_MEM_ARB_TAG((`CLOG2(num_reqs) + `CLOG2(line_size / word_size) + `CACHE_CLUSTER_CORE_ARB_TAG(tag_width, num_inputs, num_caches)), num_caches)
`define CACHE_CLUSTER_BYPASS_TAG_WIDTH(num_reqs, line_size, word_size, tag_width, num_inputs, num_caches) \
`CACHE_CLUSTER_MEM_ARB_TAG((`CACHE_NC_BYPASS_TAG_WIDTH(num_reqs, line_size, word_size, `CACHE_CLUSTER_CORE_ARB_TAG(tag_width, num_inputs, num_caches)) + `NC_TAG_BITS), num_caches)
`define CACHE_CLUSTER_NC_MEM_TAG_WIDTH(mshr_size, num_banks, num_reqs, line_size, word_size, tag_width, num_inputs, num_caches) \
`CACHE_CLUSTER_MEM_ARB_TAG(`MAX(`CACHE_MEM_TAG_WIDTH(mshr_size, num_banks), `CACHE_NC_BYPASS_TAG_WIDTH(num_reqs, line_size, word_size, `CACHE_CLUSTER_CORE_ARB_TAG(tag_width, num_inputs, num_caches))), num_caches)
///////////////////////////////////////////////////////////////////////////////
`ifdef ICACHE_ENABLE
`define L1_ENABLE
`endif
`ifdef DCACHE_ENABLE
`define L1_ENABLE
`endif
`define VX_MEM_BYTEEN_WIDTH `L3_LINE_SIZE
`define VX_MEM_ADDR_WIDTH (`MEM_ADDR_WIDTH - `CLOG2(`L3_LINE_SIZE))
`define VX_MEM_DATA_WIDTH (`L3_LINE_SIZE * 8)
`define VX_MEM_TAG_WIDTH L3_MEM_TAG_WIDTH
`define VX_DCR_ADDR_WIDTH `VX_DCR_ADDR_BITS
`define VX_DCR_DATA_WIDTH 32
`define TO_FULL_ADDR(x) {x, (`MEM_ADDR_WIDTH-$bits(x))'(0)}
///////////////////////////////////////////////////////////////////////////////
`define BUFFER_EX(dst, src, ena, latency) \
VX_pipe_register #( \
.DATAW ($bits(dst)), \
.RESETW ($bits(dst)), \
.DEPTH (latency) \
) __``dst ( \
.clk (clk), \
.reset (reset), \
.enable (ena), \
.data_in (src), \
.data_out (dst) \
)
`define BUFFER(dst, src) `BUFFER_EX(dst, src, 1'b1, 1)
`define POP_COUNT_EX(out, in, model) \
VX_popcount #( \
.N ($bits(in)), \
.MODEL (model) \
) __``out ( \
.data_in (in), \
.data_out (out) \
)
`define POP_COUNT(out, in) `POP_COUNT_EX(out, in, 1)
`define ASSIGN_VX_MEM_BUS_IF(dst, src) \
assign dst.req_valid = src.req_valid; \
assign dst.req_data = src.req_data; \
assign src.req_ready = dst.req_ready; \
assign src.rsp_valid = dst.rsp_valid; \
assign src.rsp_data = dst.rsp_data; \
assign dst.rsp_ready = src.rsp_ready
`define ASSIGN_VX_MEM_BUS_IF_X(dst, src, TD, TS) \
assign dst.req_valid = src.req_valid; \
assign dst.req_data.rw = src.req_data.rw; \
assign dst.req_data.byteen = src.req_data.byteen; \
assign dst.req_data.addr = src.req_data.addr; \
assign dst.req_data.data = src.req_data.data; \
if (TD != TS) \
assign dst.req_data.tag = {src.req_data.tag, {(TD-TS){1'b0}}}; \
else \
assign dst.req_data.tag = src.req_data.tag; \
assign src.req_ready = dst.req_ready; \
assign src.rsp_valid = dst.rsp_valid; \
assign src.rsp_data.data = dst.rsp_data.data; \
assign src.rsp_data.tag = dst.rsp_data.tag[TD-1 -: TS]; \
assign dst.rsp_ready = src.rsp_ready
`define BUFFER_DCR_BUS_IF(dst, src, enable) \
logic [(1 + `VX_DCR_ADDR_WIDTH + `VX_DCR_DATA_WIDTH)-1:0] __``dst; \
if (enable) begin \
always @(posedge clk) begin \
__``dst <= {src.write_valid, src.write_addr, src.write_data}; \
end \
end else begin \
assign __``dst = {src.write_valid, src.write_addr, src.write_data}; \
end \
VX_dcr_bus_if dst(); \
assign {dst.write_valid, dst.write_addr, dst.write_data} = __``dst
`define PERF_COUNTER_ADD(dst, src, field, width, dst_count, src_count, reg_enable) \
for (genvar __d = 0; __d < dst_count; ++__d) begin \
localparam __count = ((src_count > dst_count) ? ((src_count + dst_count - 1) / dst_count) : 1); \
wire [__count-1:0][width-1:0] __reduce_add_i_``src``field; \
wire [width-1:0] __reduce_add_o_``dst``field; \
for (genvar __i = 0; __i < __count; ++__i) begin \
assign __reduce_add_i_``src``field[__i] = ``src[__d * __count + __i].``field; \
end \
VX_reduce #(.DATAW_IN(width), .N(__count), .OP("+")) __reduce_add_``dst``field ( \
__reduce_add_i_``src``field, \
__reduce_add_o_``dst``field \
); \
if (reg_enable) begin \
reg [width-1:0] __reduce_add_r_``dst``field; \
always @(posedge clk) begin \
if (reset) begin \
__reduce_add_r_``dst``field <= '0; \
end else begin \
__reduce_add_r_``dst``field <= __reduce_add_o_``dst``field; \
end \
end \
assign ``dst[__d].``field = __reduce_add_r_``dst``field; \
end else begin \
assign ``dst[__d].``field = __reduce_add_o_``dst``field; \
end \
end
`define ASSIGN_BLOCKED_WID(dst, src, block_idx, block_size) \
if (block_size != 1) begin \
if (block_size != `NUM_WARPS) begin \
assign dst = {src[`NW_WIDTH-1:`CLOG2(block_size)], `CLOG2(block_size)'(block_idx)}; \
end else begin \
assign dst = `NW_WIDTH'(block_idx); \
end \
end else begin \
assign dst = src; \
end
`define TO_DISPATCH_DATA(data, tid) { \
data.uuid, \
data.wis, \
data.tmask, \
data.op_type, \
data.op_mod, \
data.wb, \
data.use_PC, \
data.use_imm, \
data.PC, \
data.imm, \
data.rd, \
tid, \
data.rs1_data, \
data.rs2_data, \
data.rs3_data}
///////////////////////////////////////////////////////////////////////////////
`endif // VX_DEFINE_VH

View File

@@ -1,159 +0,0 @@
`include "VX_define.vh"
module VX_dispatch (
input wire clk,
input wire reset,
// inputs
VX_ibuffer_if.slave ibuffer_if,
VX_gpr_rsp_if.slave gpr_rsp_if,
// outputs
VX_alu_req_if.master alu_req_if,
VX_lsu_req_if.master lsu_req_if,
VX_csr_req_if.master csr_req_if,
`ifdef EXT_F_ENABLE
VX_fpu_req_if.master fpu_req_if,
`endif
VX_gpu_req_if.master gpu_req_if
);
wire [`NT_BITS-1:0] tid;
wire alu_req_ready;
wire lsu_req_ready;
wire csr_req_ready;
`ifdef EXT_F_ENABLE
wire fpu_req_ready;
`endif
wire gpu_req_ready;
VX_lzc #(
.N (`NUM_THREADS)
) tid_select (
.in_i (ibuffer_if.tmask),
.cnt_o (tid),
`UNUSED_PIN (valid_o)
);
wire [31:0] next_PC = ibuffer_if.PC + 4;
// ALU unit
wire alu_req_valid = ibuffer_if.valid && (ibuffer_if.ex_type == `EX_ALU);
wire [`INST_ALU_BITS-1:0] alu_op_type = `INST_ALU_BITS'(ibuffer_if.op_type);
VX_skid_buffer #(
.DATAW (`UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + 32 + `INST_ALU_BITS + `INST_MOD_BITS + 32 + 1 + 1 + `NR_BITS + 1 + `NT_BITS + (2 * `NUM_THREADS * 32)),
.OUT_REG (1)
) alu_buffer (
.clk (clk),
.reset (reset),
.valid_in (alu_req_valid),
.ready_in (alu_req_ready),
.data_in ({ibuffer_if.uuid, ibuffer_if.wid, ibuffer_if.tmask, ibuffer_if.PC, next_PC, alu_op_type, ibuffer_if.op_mod, ibuffer_if.imm, ibuffer_if.use_PC, ibuffer_if.use_imm, ibuffer_if.rd, ibuffer_if.wb, tid, gpr_rsp_if.rs1_data, gpr_rsp_if.rs2_data}),
.data_out ({alu_req_if.uuid, alu_req_if.wid, alu_req_if.tmask, alu_req_if.PC, alu_req_if.next_PC, alu_req_if.op_type, alu_req_if.op_mod, alu_req_if.imm, alu_req_if.use_PC, alu_req_if.use_imm, alu_req_if.rd, alu_req_if.wb, alu_req_if.tid, alu_req_if.rs1_data, alu_req_if.rs2_data}),
.valid_out (alu_req_if.valid),
.ready_out (alu_req_if.ready)
);
// lsu unit
wire lsu_req_valid = ibuffer_if.valid && (ibuffer_if.ex_type == `EX_LSU);
wire [`INST_LSU_BITS-1:0] lsu_op_type = `INST_LSU_BITS'(ibuffer_if.op_type);
wire lsu_is_fence = `INST_LSU_IS_FENCE(ibuffer_if.op_mod);
wire lsu_is_prefetch = `INST_LSU_IS_PREFETCH(ibuffer_if.op_mod);
VX_skid_buffer #(
.DATAW (`UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `INST_LSU_BITS + 1 + 32 + `NR_BITS + 1 + (2 * `NUM_THREADS * 32) + 1),
.OUT_REG (1)
) lsu_buffer (
.clk (clk),
.reset (reset),
.valid_in (lsu_req_valid),
.ready_in (lsu_req_ready),
.data_in ({ibuffer_if.uuid, ibuffer_if.wid, ibuffer_if.tmask, ibuffer_if.PC, lsu_op_type, lsu_is_fence, ibuffer_if.imm, ibuffer_if.rd, ibuffer_if.wb, gpr_rsp_if.rs1_data, gpr_rsp_if.rs2_data, lsu_is_prefetch}),
.data_out ({lsu_req_if.uuid, lsu_req_if.wid, lsu_req_if.tmask, lsu_req_if.PC, lsu_req_if.op_type, lsu_req_if.is_fence, lsu_req_if.offset, lsu_req_if.rd, lsu_req_if.wb, lsu_req_if.base_addr, lsu_req_if.store_data, lsu_req_if.is_prefetch}),
.valid_out (lsu_req_if.valid),
.ready_out (lsu_req_if.ready)
);
// csr unit
wire csr_req_valid = ibuffer_if.valid && (ibuffer_if.ex_type == `EX_CSR);
wire [`INST_CSR_BITS-1:0] csr_op_type = `INST_CSR_BITS'(ibuffer_if.op_type);
wire [`CSR_ADDR_BITS-1:0] csr_addr = ibuffer_if.imm[`CSR_ADDR_BITS-1:0];
wire [`NRI_BITS-1:0] csr_imm = ibuffer_if.imm[`CSR_ADDR_BITS +: `NRI_BITS];
wire [31:0] csr_rs1_data = gpr_rsp_if.rs1_data[tid];
VX_skid_buffer #(
.DATAW (`UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `INST_CSR_BITS + `CSR_ADDR_BITS + `NR_BITS + 1 + 1 + `NRI_BITS + 32),
.OUT_REG (1)
) csr_buffer (
.clk (clk),
.reset (reset),
.valid_in (csr_req_valid),
.ready_in (csr_req_ready),
.data_in ({ibuffer_if.uuid, ibuffer_if.wid, ibuffer_if.tmask, ibuffer_if.PC, csr_op_type, csr_addr, ibuffer_if.rd, ibuffer_if.wb, ibuffer_if.use_imm, csr_imm, csr_rs1_data}),
.data_out ({csr_req_if.uuid, csr_req_if.wid, csr_req_if.tmask, csr_req_if.PC, csr_req_if.op_type, csr_req_if.addr, csr_req_if.rd, csr_req_if.wb, csr_req_if.use_imm, csr_req_if.imm, csr_req_if.rs1_data}),
.valid_out (csr_req_if.valid),
.ready_out (csr_req_if.ready)
);
// fpu unit
`ifdef EXT_F_ENABLE
wire fpu_req_valid = ibuffer_if.valid && (ibuffer_if.ex_type == `EX_FPU);
wire [`INST_FPU_BITS-1:0] fpu_op_type = `INST_FPU_BITS'(ibuffer_if.op_type);
VX_skid_buffer #(
.DATAW (`UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `INST_FPU_BITS + `INST_MOD_BITS + `NR_BITS + 1 + (3 * `NUM_THREADS * 32)),
.OUT_REG (1)
) fpu_buffer (
.clk (clk),
.reset (reset),
.valid_in (fpu_req_valid),
.ready_in (fpu_req_ready),
.data_in ({ibuffer_if.uuid, ibuffer_if.wid, ibuffer_if.tmask, ibuffer_if.PC, fpu_op_type, ibuffer_if.op_mod, ibuffer_if.rd, ibuffer_if.wb, gpr_rsp_if.rs1_data, gpr_rsp_if.rs2_data, gpr_rsp_if.rs3_data}),
.data_out ({fpu_req_if.uuid, fpu_req_if.wid, fpu_req_if.tmask, fpu_req_if.PC, fpu_req_if.op_type, fpu_req_if.op_mod, fpu_req_if.rd, fpu_req_if.wb, fpu_req_if.rs1_data, fpu_req_if.rs2_data, fpu_req_if.rs3_data}),
.valid_out (fpu_req_if.valid),
.ready_out (fpu_req_if.ready)
);
`else
`UNUSED_VAR (gpr_rsp_if.rs3_data)
`endif
// gpu unit
wire gpu_req_valid = ibuffer_if.valid && (ibuffer_if.ex_type == `EX_GPU);
wire [`INST_GPU_BITS-1:0] gpu_op_type = `INST_GPU_BITS'(ibuffer_if.op_type);
VX_skid_buffer #(
.DATAW (`UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + 32 + `INST_GPU_BITS + `INST_MOD_BITS + `NR_BITS + 1 + `NT_BITS + (3 * `NUM_THREADS * 32)),
.OUT_REG (1)
) gpu_buffer (
.clk (clk),
.reset (reset),
.valid_in (gpu_req_valid),
.ready_in (gpu_req_ready),
.data_in ({ibuffer_if.uuid, ibuffer_if.wid, ibuffer_if.tmask, ibuffer_if.PC, next_PC, gpu_op_type, ibuffer_if.op_mod, ibuffer_if.rd, ibuffer_if.wb, tid, gpr_rsp_if.rs1_data, gpr_rsp_if.rs2_data, gpr_rsp_if.rs3_data}),
.data_out ({gpu_req_if.uuid, gpu_req_if.wid, gpu_req_if.tmask, gpu_req_if.PC, gpu_req_if.next_PC, gpu_req_if.op_type, gpu_req_if.op_mod, gpu_req_if.rd, gpu_req_if.wb, gpu_req_if.tid, gpu_req_if.rs1_data, gpu_req_if.rs2_data, gpu_req_if.rs3_data}),
.valid_out (gpu_req_if.valid),
.ready_out (gpu_req_if.ready)
);
// can take next request?
reg ready_r;
always @(*) begin
case (ibuffer_if.ex_type)
`EX_ALU: ready_r = alu_req_ready;
`EX_LSU: ready_r = lsu_req_ready;
`EX_CSR: ready_r = csr_req_ready;
`ifdef EXT_F_ENABLE
`EX_FPU: ready_r = fpu_req_ready;
`endif
`EX_GPU: ready_r = gpu_req_ready;
default: ready_r = 1'b1; // ignore NOPs
endcase
end
assign ibuffer_if.ready = ready_r;
endmodule

View File

@@ -1,237 +0,0 @@
`include "VX_define.vh"
module VX_execute #(
parameter CORE_ID = 0
) (
`SCOPE_IO_VX_execute
input wire clk,
input wire reset,
// Dcache interface
VX_dcache_req_if.master dcache_req_if,
VX_dcache_rsp_if.slave dcache_rsp_if,
// commit interface
VX_cmt_to_csr_if.slave cmt_to_csr_if,
// fetch interface
VX_fetch_to_csr_if.slave fetch_to_csr_if,
`ifdef PERF_ENABLE
VX_perf_memsys_if.slave perf_memsys_if,
VX_perf_pipeline_if.slave perf_pipeline_if,
`endif
// inputs
VX_alu_req_if.slave alu_req_if,
VX_lsu_req_if.slave lsu_req_if,
VX_csr_req_if.slave csr_req_if,
`ifdef EXT_F_ENABLE
VX_fpu_req_if.slave fpu_req_if,
`endif
VX_gpu_req_if.slave gpu_req_if,
// outputs
VX_branch_ctl_if.master branch_ctl_if,
VX_warp_ctl_if.master warp_ctl_if,
VX_commit_if.master alu_commit_if,
VX_commit_if.master ld_commit_if,
VX_commit_if.master st_commit_if,
VX_commit_if.master csr_commit_if,
`ifdef EXT_F_ENABLE
VX_commit_if.master fpu_commit_if,
`endif
VX_commit_if.master gpu_commit_if,
input wire busy
);
`ifdef EXT_TEX_ENABLE
VX_dcache_req_if #(
.NUM_REQS (`NUM_THREADS),
.WORD_SIZE (4),
.TAG_WIDTH (`LSU_TEX_DCACHE_TAG_BITS)
) lsu_dcache_req_if();
VX_dcache_rsp_if #(
.NUM_REQS (`NUM_THREADS),
.WORD_SIZE (4),
.TAG_WIDTH (`LSU_TEX_DCACHE_TAG_BITS)
) lsu_dcache_rsp_if();
VX_dcache_req_if #(
.NUM_REQS (`NUM_THREADS),
.WORD_SIZE (4),
.TAG_WIDTH (`LSU_TEX_DCACHE_TAG_BITS)
) tex_dcache_req_if();
VX_dcache_rsp_if #(
.NUM_REQS (`NUM_THREADS),
.WORD_SIZE (4),
.TAG_WIDTH (`LSU_TEX_DCACHE_TAG_BITS)
) tex_dcache_rsp_if();
VX_tex_csr_if tex_csr_if();
`ifdef PERF_ENABLE
VX_perf_tex_if perf_tex_if();
`endif
VX_cache_arb #(
.NUM_REQS (2),
.LANES (`NUM_THREADS),
.DATA_SIZE (4),
.TAG_IN_WIDTH (`LSU_TEX_DCACHE_TAG_BITS),
.TAG_SEL_IDX (`NC_TAG_BIT + `SM_ENABLE)
) tex_lsu_arb (
.clk (clk),
.reset (reset),
// Tex/LSU request
.req_valid_in ({tex_dcache_req_if.valid, lsu_dcache_req_if.valid}),
.req_rw_in ({tex_dcache_req_if.rw, lsu_dcache_req_if.rw}),
.req_byteen_in ({tex_dcache_req_if.byteen, lsu_dcache_req_if.byteen}),
.req_addr_in ({tex_dcache_req_if.addr, lsu_dcache_req_if.addr}),
.req_data_in ({tex_dcache_req_if.data, lsu_dcache_req_if.data}),
.req_tag_in ({tex_dcache_req_if.tag, lsu_dcache_req_if.tag}),
.req_ready_in ({tex_dcache_req_if.ready, lsu_dcache_req_if.ready}),
// Dcache request
.req_valid_out (dcache_req_if.valid),
.req_rw_out (dcache_req_if.rw),
.req_byteen_out (dcache_req_if.byteen),
.req_addr_out (dcache_req_if.addr),
.req_data_out (dcache_req_if.data),
.req_tag_out (dcache_req_if.tag),
.req_ready_out (dcache_req_if.ready),
// Dcache response
.rsp_valid_in (dcache_rsp_if.valid),
.rsp_tmask_in (dcache_rsp_if.tmask),
.rsp_tag_in (dcache_rsp_if.tag),
.rsp_data_in (dcache_rsp_if.data),
.rsp_ready_in (dcache_rsp_if.ready),
// Tex/LSU response
.rsp_valid_out ({tex_dcache_rsp_if.valid, lsu_dcache_rsp_if.valid}),
.rsp_tmask_out ({tex_dcache_rsp_if.tmask, lsu_dcache_rsp_if.tmask}),
.rsp_data_out ({tex_dcache_rsp_if.data, lsu_dcache_rsp_if.data}),
.rsp_tag_out ({tex_dcache_rsp_if.tag, lsu_dcache_rsp_if.tag}),
.rsp_ready_out ({tex_dcache_rsp_if.ready, lsu_dcache_rsp_if.ready})
);
`endif
`ifdef EXT_F_ENABLE
wire [`NUM_WARPS-1:0] csr_pending;
wire [`NUM_WARPS-1:0] fpu_pending;
VX_fpu_to_csr_if fpu_to_csr_if();
`endif
`RESET_RELAY (alu_reset);
`RESET_RELAY (lsu_reset);
`RESET_RELAY (csr_reset);
`RESET_RELAY (gpu_reset);
VX_alu_unit #(
.CORE_ID(CORE_ID)
) alu_unit (
.clk (clk),
.reset (alu_reset),
.alu_req_if (alu_req_if),
.branch_ctl_if (branch_ctl_if),
.alu_commit_if (alu_commit_if)
);
VX_lsu_unit #(
.CORE_ID(CORE_ID)
) lsu_unit (
`SCOPE_BIND_VX_execute_lsu_unit
.clk (clk),
.reset (lsu_reset),
`ifdef EXT_TEX_ENABLE
.dcache_req_if (lsu_dcache_req_if),
.dcache_rsp_if (lsu_dcache_rsp_if),
`else
.dcache_req_if (dcache_req_if),
.dcache_rsp_if (dcache_rsp_if),
`endif
.lsu_req_if (lsu_req_if),
.ld_commit_if (ld_commit_if),
.st_commit_if (st_commit_if)
);
VX_csr_unit #(
.CORE_ID(CORE_ID)
) csr_unit (
.clk (clk),
.reset (csr_reset),
`ifdef PERF_ENABLE
`ifdef EXT_TEX_ENABLE
.perf_tex_if (perf_tex_if),
`endif
.perf_memsys_if (perf_memsys_if),
.perf_pipeline_if(perf_pipeline_if),
`endif
.cmt_to_csr_if (cmt_to_csr_if),
.fetch_to_csr_if(fetch_to_csr_if),
.csr_req_if (csr_req_if),
.csr_commit_if (csr_commit_if),
`ifdef EXT_F_ENABLE
.fpu_to_csr_if (fpu_to_csr_if),
.fpu_pending (fpu_pending),
.pending (csr_pending),
`else
`UNUSED_PIN (pending),
`endif
`ifdef EXT_TEX_ENABLE
.tex_csr_if (tex_csr_if),
`endif
.busy (busy)
);
`ifdef EXT_F_ENABLE
`RESET_RELAY (fpu_reset);
VX_fpu_unit #(
.CORE_ID(CORE_ID)
) fpu_unit (
.clk (clk),
.reset (fpu_reset),
.fpu_req_if (fpu_req_if),
.fpu_to_csr_if (fpu_to_csr_if),
.fpu_commit_if (fpu_commit_if),
.csr_pending (csr_pending),
.pending (fpu_pending)
);
`endif
VX_gpu_unit #(
.CORE_ID(CORE_ID)
) gpu_unit (
`SCOPE_BIND_VX_execute_gpu_unit
.clk (clk),
.reset (gpu_reset),
.gpu_req_if (gpu_req_if),
`ifdef EXT_TEX_ENABLE
`ifdef PERF_ENABLE
.perf_tex_if (perf_tex_if),
`endif
.tex_csr_if (tex_csr_if),
.dcache_req_if (tex_dcache_req_if),
.dcache_rsp_if (tex_dcache_rsp_if),
`endif
.warp_ctl_if (warp_ctl_if),
.gpu_commit_if (gpu_commit_if)
);
// special workaround to get RISC-V tests Pass/Fail status
wire ebreak /* verilator public */;
assign ebreak = alu_req_if.valid && alu_req_if.ready
&& `INST_ALU_IS_BR(alu_req_if.op_mod)
&& (`INST_BR_BITS'(alu_req_if.op_type) == `INST_BR_EBREAK
|| `INST_BR_BITS'(alu_req_if.op_type) == `INST_BR_ECALL);
endmodule

View File

@@ -1,68 +0,0 @@
`include "VX_define.vh"
module VX_fetch #(
parameter CORE_ID = 0
) (
`SCOPE_IO_VX_fetch
input wire clk,
input wire reset,
// Icache interface
VX_icache_req_if.master icache_req_if,
VX_icache_rsp_if.slave icache_rsp_if,
// inputs
VX_wstall_if.slave wstall_if,
VX_join_if.slave join_if,
VX_branch_ctl_if.slave branch_ctl_if,
VX_warp_ctl_if.slave warp_ctl_if,
// outputs
VX_ifetch_rsp_if.master ifetch_rsp_if,
// csr interface
VX_fetch_to_csr_if.master fetch_to_csr_if,
// busy status
output wire busy
);
VX_ifetch_req_if ifetch_req_if();
VX_warp_sched #(
.CORE_ID(CORE_ID)
) warp_sched (
`SCOPE_BIND_VX_fetch_warp_sched
.clk (clk),
.reset (reset),
.warp_ctl_if (warp_ctl_if),
.wstall_if (wstall_if),
.join_if (join_if),
.branch_ctl_if (branch_ctl_if),
.ifetch_req_if (ifetch_req_if),
.fetch_to_csr_if (fetch_to_csr_if),
.busy (busy)
);
VX_icache_stage #(
.CORE_ID(CORE_ID)
) icache_stage (
`SCOPE_BIND_VX_fetch_icache_stage
.clk (clk),
.reset (reset),
.icache_rsp_if (icache_rsp_if),
.icache_req_if (icache_req_if),
.ifetch_req_if (ifetch_req_if),
.ifetch_rsp_if (ifetch_rsp_if)
);
endmodule

View File

@@ -1,219 +0,0 @@
`include "VX_define.vh"
module VX_fpu_unit #(
parameter CORE_ID = 0
) (
input wire clk,
input wire reset,
VX_fpu_req_if.slave fpu_req_if,
VX_fpu_to_csr_if.master fpu_to_csr_if,
VX_commit_if.master fpu_commit_if,
input wire[`NUM_WARPS-1:0] csr_pending,
output wire[`NUM_WARPS-1:0] pending
);
import fpu_types::*;
`UNUSED_PARAM (CORE_ID)
localparam FPUQ_BITS = `LOG2UP(`FPUQ_SIZE);
wire ready_in;
wire valid_out;
wire ready_out;
wire [`UUID_BITS-1:0] rsp_uuid;
wire [`NW_BITS-1:0] rsp_wid;
wire [`NUM_THREADS-1:0] rsp_tmask;
wire [31:0] rsp_PC;
wire [`NR_BITS-1:0] rsp_rd;
wire rsp_wb;
wire has_fflags;
fflags_t [`NUM_THREADS-1:0] fflags;
wire [`NUM_THREADS-1:0][31:0] result;
wire [FPUQ_BITS-1:0] tag_in, tag_out;
wire fpuq_full;
wire fpuq_push = fpu_req_if.valid && fpu_req_if.ready;
wire fpuq_pop = valid_out && ready_out;
VX_index_buffer #(
.DATAW (`UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `NR_BITS + 1),
.SIZE (`FPUQ_SIZE)
) req_metadata (
.clk (clk),
.reset (reset),
.acquire_slot (fpuq_push),
.write_addr (tag_in),
.read_addr (tag_out),
.release_addr (tag_out),
.write_data ({fpu_req_if.uuid, fpu_req_if.wid, fpu_req_if.tmask, fpu_req_if.PC, fpu_req_if.rd, fpu_req_if.wb}),
.read_data ({rsp_uuid, rsp_wid, rsp_tmask, rsp_PC, rsp_rd, rsp_wb}),
.release_slot (fpuq_pop),
.full (fpuq_full),
`UNUSED_PIN (empty)
);
// can accept new request?
assign fpu_req_if.ready = ready_in && ~fpuq_full && !csr_pending[fpu_req_if.wid];
wire valid_in = fpu_req_if.valid && ~fpuq_full && !csr_pending[fpu_req_if.wid];
// resolve dynamic FRM from CSR
assign fpu_to_csr_if.read_wid = fpu_req_if.wid;
wire [`INST_FRM_BITS-1:0] fpu_frm = (fpu_req_if.op_mod == `INST_FRM_DYN) ? fpu_to_csr_if.read_frm : fpu_req_if.op_mod;
`ifdef FPU_DPI
VX_fpu_dpi #(
.TAGW (FPUQ_BITS)
) fpu_dpi (
.clk (clk),
.reset (reset),
.valid_in (valid_in),
.ready_in (ready_in),
.tag_in (tag_in),
.op_type (fpu_req_if.op_type),
.frm (fpu_frm),
.dataa (fpu_req_if.rs1_data),
.datab (fpu_req_if.rs2_data),
.datac (fpu_req_if.rs3_data),
.result (result),
.has_fflags (has_fflags),
.fflags (fflags),
.tag_out (tag_out),
.ready_out (ready_out),
.valid_out (valid_out)
);
`elsif FPU_FPNEW
VX_fpu_fpnew #(
.FMULADD (1),
.FDIVSQRT (1),
.FNONCOMP (1),
.FCONV (1),
.TAGW (FPUQ_BITS)
) fpu_fpnew (
.clk (clk),
.reset (reset),
.valid_in (valid_in),
.ready_in (ready_in),
.tag_in (tag_in),
.op_type (fpu_req_if.op_type),
.frm (fpu_frm),
.dataa (fpu_req_if.rs1_data),
.datab (fpu_req_if.rs2_data),
.datac (fpu_req_if.rs3_data),
.result (result),
.has_fflags (has_fflags),
.fflags (fflags),
.tag_out (tag_out),
.ready_out (ready_out),
.valid_out (valid_out)
);
`else
VX_fpu_fpga #(
.TAGW (FPUQ_BITS)
) fpu_fpga (
.clk (clk),
.reset (reset),
.valid_in (valid_in),
.ready_in (ready_in),
.tag_in (tag_in),
.op_type (fpu_req_if.op_type),
.frm (fpu_frm),
.dataa (fpu_req_if.rs1_data),
.datab (fpu_req_if.rs2_data),
.datac (fpu_req_if.rs3_data),
.result (result),
.has_fflags (has_fflags),
.fflags (fflags),
.tag_out (tag_out),
.ready_out (ready_out),
.valid_out (valid_out)
);
`endif
reg has_fflags_r;
fflags_t fflags_r;
fflags_t rsp_fflags;
always @(*) begin
rsp_fflags = '0;
for (integer i = 0; i < `NUM_THREADS; i++) begin
if (rsp_tmask[i]) begin
rsp_fflags.NX |= fflags[i].NX;
rsp_fflags.UF |= fflags[i].UF;
rsp_fflags.OF |= fflags[i].OF;
rsp_fflags.DZ |= fflags[i].DZ;
rsp_fflags.NV |= fflags[i].NV;
end
end
end
wire stall_out = ~fpu_commit_if.ready && fpu_commit_if.valid;
VX_pipe_register #(
.DATAW (1 + `UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `NR_BITS + 1 + (`NUM_THREADS * 32) + 1 + `FFLAGS_BITS),
.RESETW (1)
) pipe_reg (
.clk (clk),
.reset (reset),
.enable (!stall_out),
.data_in ({valid_out, rsp_uuid, rsp_wid, rsp_tmask, rsp_PC, rsp_rd, rsp_wb, result, has_fflags, rsp_fflags}),
.data_out ({fpu_commit_if.valid, fpu_commit_if.uuid, fpu_commit_if.wid, fpu_commit_if.tmask, fpu_commit_if.PC, fpu_commit_if.rd, fpu_commit_if.wb, fpu_commit_if.data, has_fflags_r, fflags_r})
);
assign fpu_commit_if.eop = 1'b1;
assign ready_out = ~stall_out;
// CSR fflags Update
assign fpu_to_csr_if.write_enable = fpu_commit_if.valid && fpu_commit_if.ready && has_fflags_r;
assign fpu_to_csr_if.write_wid = fpu_commit_if.wid;
assign fpu_to_csr_if.write_fflags = fflags_r;
// pending request
reg [`NUM_WARPS-1:0] pending_r;
always @(posedge clk) begin
if (reset) begin
pending_r <= 0;
end else begin
if (fpu_commit_if.valid && fpu_commit_if.ready) begin
pending_r[fpu_commit_if.wid] <= 0;
end
if (fpu_req_if.valid && fpu_req_if.ready) begin
pending_r[fpu_req_if.wid] <= 1;
end
end
end
assign pending = pending_r;
endmodule

View File

@@ -1,91 +0,0 @@
`include "VX_define.vh"
module VX_gpr_stage #(
parameter CORE_ID = 0
) (
input wire clk,
input wire reset,
// inputs
VX_writeback_if.slave writeback_if,
VX_gpr_req_if.slave gpr_req_if,
// outputs
VX_gpr_rsp_if.master gpr_rsp_if
);
`UNUSED_PARAM (CORE_ID)
`UNUSED_VAR (reset)
localparam RAM_SIZE = `NUM_WARPS * `NUM_REGS;
// ensure r0 never gets written, which can happen before the reset
wire write_enable = writeback_if.valid && (writeback_if.rd != 0);
wire [`NUM_THREADS-1:0] wren;
for (genvar i = 0; i < `NUM_THREADS; ++i) begin
assign wren[i] = write_enable && writeback_if.tmask[i];
end
wire [$clog2(RAM_SIZE)-1:0] waddr, raddr1, raddr2;
assign waddr = {writeback_if.wid, writeback_if.rd};
assign raddr1 = {gpr_req_if.wid, gpr_req_if.rs1};
assign raddr2 = {gpr_req_if.wid, gpr_req_if.rs2};
for (genvar i = 0; i < `NUM_THREADS; ++i) begin
VX_dp_ram #(
.DATAW (32),
.SIZE (RAM_SIZE),
.INIT_ENABLE (1),
.INIT_VALUE (0)
) dp_ram1 (
.clk (clk),
.wren (wren[i]),
.waddr (waddr),
.wdata (writeback_if.data[i]),
.raddr (raddr1),
.rdata (gpr_rsp_if.rs1_data[i])
);
VX_dp_ram #(
.DATAW (32),
.SIZE (RAM_SIZE),
.INIT_ENABLE (1),
.INIT_VALUE (0)
) dp_ram2 (
.clk (clk),
.wren (wren[i]),
.waddr (waddr),
.wdata (writeback_if.data[i]),
.raddr (raddr2),
.rdata (gpr_rsp_if.rs2_data[i])
);
end
`ifdef EXT_F_ENABLE
wire [$clog2(RAM_SIZE)-1:0] raddr3;
assign raddr3 = {gpr_req_if.wid, gpr_req_if.rs3};
for (genvar i = 0; i < `NUM_THREADS; ++i) begin
VX_dp_ram #(
.DATAW (32),
.SIZE (RAM_SIZE),
.INIT_ENABLE (1),
.INIT_VALUE (0)
) dp_ram3 (
.clk (clk),
.wren (wren[i]),
.waddr (waddr),
.wdata (writeback_if.data[i]),
.raddr (raddr3),
.rdata (gpr_rsp_if.rs3_data[i])
);
end
`else
`UNUSED_VAR (gpr_req_if.rs3)
assign gpr_rsp_if.rs3_data = 'x;
`endif
assign writeback_if.ready = 1'b1;
endmodule

248
hw/rtl/VX_gpu_pkg.sv Normal file
View File

@@ -0,0 +1,248 @@
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
`ifndef VX_GPU_PKG_VH
`define VX_GPU_PKG_VH
`include "VX_define.vh"
package VX_gpu_pkg;
typedef struct packed {
logic valid;
logic [`NUM_THREADS-1:0] tmask;
} tmc_t;
typedef struct packed {
logic valid;
logic [`NUM_WARPS-1:0] wmask;
logic [`XLEN-1:0] pc;
} wspawn_t;
typedef struct packed {
logic valid;
logic is_dvg;
logic [`NUM_THREADS-1:0] then_tmask;
logic [`NUM_THREADS-1:0] else_tmask;
logic [`XLEN-1:0] next_pc;
} split_t;
typedef struct packed {
logic valid;
logic is_dvg;
} join_t;
typedef struct packed {
logic valid;
logic [`NB_WIDTH-1:0] id;
logic is_global;
logic [1:0] domain;
logic [`NUM_WARPS-1:0] mask;
`ifdef GBAR_ENABLE
logic [`MAX(`NW_WIDTH, `NC_WIDTH)-1:0] size_m1;
`else
logic [`NW_WIDTH-1:0] size_m1;
`endif
} barrier_t;
localparam logic [1:0] BARRIER_ALL = 2'd0;
localparam logic [1:0] BARRIER_SCALAR = 2'd1;
localparam logic [1:0] BARRIER_TENSOR = 2'd2;
localparam logic [1:0] BARRIER_MASK = 2'd3;
localparam logic WU_DOMAIN_SCALAR = 1'b0;
localparam logic WU_DOMAIN_TENSOR = 1'b1;
typedef struct packed {
logic [`XLEN-1:0] startup_addr;
logic [7:0] mpm_class;
} base_dcrs_t;
typedef struct packed {
logic [`PERF_CTR_BITS-1:0] reads;
logic [`PERF_CTR_BITS-1:0] writes;
logic [`PERF_CTR_BITS-1:0] read_misses;
logic [`PERF_CTR_BITS-1:0] write_misses;
logic [`PERF_CTR_BITS-1:0] bank_stalls;
logic [`PERF_CTR_BITS-1:0] mshr_stalls;
logic [`PERF_CTR_BITS-1:0] mem_stalls;
logic [`PERF_CTR_BITS-1:0] crsp_stalls;
} cache_perf_t;
typedef struct packed {
logic [`PERF_CTR_BITS-1:0] reads;
logic [`PERF_CTR_BITS-1:0] writes;
logic [`PERF_CTR_BITS-1:0] latency;
} mem_perf_t;
/* verilator lint_off UNUSED */
////////////////////////// Icache Parameters //////////////////////////////
// Word size in bytes
localparam ICACHE_WORD_SIZE = 4;
localparam ICACHE_ADDR_WIDTH = (`MEM_ADDR_WIDTH - `CLOG2(ICACHE_WORD_SIZE));
// Block size in bytes
localparam ICACHE_LINE_SIZE = `L1_LINE_SIZE;
// Core request tag Id bits
localparam ICACHE_TAG_ID_BITS = `NW_WIDTH;
// Core request tag bits
localparam ICACHE_TAG_WIDTH = (1 + `UUID_WIDTH + ICACHE_TAG_ID_BITS);
// Memory request data bits
localparam ICACHE_MEM_DATA_WIDTH = (ICACHE_LINE_SIZE * 8);
// Memory request tag bits
`ifdef ICACHE_ENABLE
localparam ICACHE_MEM_TAG_WIDTH = `CACHE_CLUSTER_MEM_TAG_WIDTH(`ICACHE_MSHR_SIZE, 1, `NUM_ICACHES);
`else
localparam ICACHE_MEM_TAG_WIDTH = `CACHE_CLUSTER_BYPASS_TAG_WIDTH(1, ICACHE_LINE_SIZE, ICACHE_WORD_SIZE, ICACHE_TAG_WIDTH, `SOCKET_SIZE, `NUM_ICACHES);
`endif
////////////////////////// Dcache Parameters //////////////////////////////
// Word size in bytes
localparam DCACHE_WORD_SIZE = (`XLEN / 8);
localparam DCACHE_ADDR_WIDTH = (`MEM_ADDR_WIDTH - `CLOG2(DCACHE_WORD_SIZE));
// Block size in bytes
localparam DCACHE_LINE_SIZE = `L1_LINE_SIZE;
// Input request size
localparam DCACHE_NUM_REQS = `MAX(`DCACHE_NUM_BANKS, `SMEM_NUM_BANKS);
// Memory request size
localparam LSU_MEM_REQS = `NUM_LSU_LANES;
// Batch select bits
localparam DCACHE_NUM_BATCHES = ((LSU_MEM_REQS + DCACHE_NUM_REQS - 1) / DCACHE_NUM_REQS);
localparam DCACHE_BATCH_SEL_BITS = `CLOG2(DCACHE_NUM_BATCHES);
// Core request tag Id bits
localparam LSUQ_TAG_BITS = (`CLOG2(`LSUQ_SIZE) + DCACHE_BATCH_SEL_BITS);
localparam DCACHE_TAG_ID_BITS = (LSUQ_TAG_BITS + `CACHE_ADDR_TYPE_BITS);
// Core request tag bits
localparam DCACHE_TAG_WIDTH = (`UUID_WIDTH + DCACHE_TAG_ID_BITS);
localparam DCACHE_NOSM_TAG_WIDTH = (DCACHE_TAG_WIDTH - `SM_ENABLED);
// Memory request data bits
localparam DCACHE_MEM_DATA_WIDTH = (DCACHE_LINE_SIZE * 8);
// Memory request tag bits
`ifdef DCACHE_ENABLE
localparam DCACHE_MEM_TAG_WIDTH = `CACHE_CLUSTER_NC_MEM_TAG_WIDTH(`DCACHE_MSHR_SIZE, `DCACHE_NUM_BANKS, DCACHE_NUM_REQS, DCACHE_LINE_SIZE, DCACHE_WORD_SIZE, DCACHE_NOSM_TAG_WIDTH, `SOCKET_SIZE, `NUM_DCACHES);
`else
localparam DCACHE_MEM_TAG_WIDTH = `CACHE_CLUSTER_NC_BYPASS_TAG_WIDTH(DCACHE_NUM_REQS, DCACHE_LINE_SIZE, DCACHE_WORD_SIZE, DCACHE_NOSM_TAG_WIDTH, `SOCKET_SIZE, `NUM_DCACHES);
`endif
/////////////////////////////// L1 Parameters /////////////////////////////
localparam L1_MEM_TAG_WIDTH = `MAX(ICACHE_MEM_TAG_WIDTH, DCACHE_MEM_TAG_WIDTH);
localparam L1_MEM_ARB_TAG_WIDTH = (L1_MEM_TAG_WIDTH + `CLOG2(2));
/////////////////////////////// L2 Parameters /////////////////////////////
localparam ICACHE_MEM_ARB_IDX = 0;
localparam DCACHE_MEM_ARB_IDX = ICACHE_MEM_ARB_IDX + 1;
// Word size in bytes
localparam L2_WORD_SIZE = `L1_LINE_SIZE;
// Input request size
localparam L2_NUM_REQS = `NUM_SOCKETS;
// Core request tag bits
localparam L2_TAG_WIDTH = L1_MEM_ARB_TAG_WIDTH;
// Memory request data bits
localparam L2_MEM_DATA_WIDTH = (`L2_LINE_SIZE * 8);
// Memory request tag bits
`ifdef L2_ENABLE
localparam L2_MEM_TAG_WIDTH = `CACHE_NC_MEM_TAG_WIDTH(`L2_MSHR_SIZE, `L2_NUM_BANKS, L2_NUM_REQS, `L2_LINE_SIZE, L2_WORD_SIZE, L2_TAG_WIDTH);
`else
localparam L2_MEM_TAG_WIDTH = `CACHE_NC_BYPASS_TAG_WIDTH(L2_NUM_REQS, `L2_LINE_SIZE, L2_WORD_SIZE, L2_TAG_WIDTH);
`endif
/////////////////////////////// L3 Parameters /////////////////////////////
// Word size in bytes
localparam L3_WORD_SIZE = `L2_LINE_SIZE;
// Input request size
localparam L3_NUM_REQS = `NUM_CLUSTERS;
// Core request tag bits
localparam L3_TAG_WIDTH = L2_MEM_TAG_WIDTH;
// Memory request data bits
localparam L3_MEM_DATA_WIDTH = (`L3_LINE_SIZE * 8);
// Memory request tag bits
`ifdef L3_ENABLE
localparam L3_MEM_TAG_WIDTH = `CACHE_NC_MEM_TAG_WIDTH(`L3_MSHR_SIZE, `L3_NUM_BANKS, L3_NUM_REQS, `L3_LINE_SIZE, L3_WORD_SIZE, L3_TAG_WIDTH);
`else
localparam L3_MEM_TAG_WIDTH = `CACHE_NC_BYPASS_TAG_WIDTH(L3_NUM_REQS, `L3_LINE_SIZE, L3_WORD_SIZE, L3_TAG_WIDTH);
`endif
/* verilator lint_on UNUSED */
/////////////////////////////// Issue parameters //////////////////////////
localparam ISSUE_ISW = `CLOG2(`ISSUE_WIDTH);
localparam ISSUE_ISW_W = `UP(ISSUE_ISW);
localparam ISSUE_RATIO = `NUM_WARPS / `ISSUE_WIDTH;
localparam ISSUE_WIS = `CLOG2(ISSUE_RATIO);
localparam ISSUE_WIS_W = `UP(ISSUE_WIS);
`IGNORE_UNUSED_BEGIN
function logic [`NW_WIDTH-1:0] wis_to_wid(
input logic [ISSUE_WIS_W-1:0] wis,
input logic [ISSUE_ISW_W-1:0] isw
);
if (ISSUE_WIS == 0) begin
wis_to_wid = `NW_WIDTH'(isw);
end else if (ISSUE_ISW == 0) begin
wis_to_wid = `NW_WIDTH'(wis);
end else begin
wis_to_wid = `NW_WIDTH'({wis, isw});
end
endfunction
function logic [ISSUE_ISW_W-1:0] wid_to_isw(
input logic [`NW_WIDTH-1:0] wid
);
if (ISSUE_ISW != 0) begin
wid_to_isw = wid[ISSUE_ISW_W-1:0];
end else begin
wid_to_isw = 0;
end
endfunction
function logic [ISSUE_WIS_W-1:0] wid_to_wis(
input logic [`NW_WIDTH-1:0] wid
);
if (ISSUE_WIS != 0) begin
wid_to_wis = ISSUE_WIS_W'(wid >> ISSUE_ISW);
end else begin
wid_to_wis = 0;
end
endfunction
`IGNORE_UNUSED_END
endpackage
`endif // VX_GPU_PKG_VH

View File

@@ -1,43 +0,0 @@
`ifndef VX_GPU_TYPES
`define VX_GPU_TYPES
`include "VX_define.vh"
package gpu_types;
typedef struct packed {
logic valid;
logic [`NUM_THREADS-1:0] tmask;
} gpu_tmc_t;
`define GPU_TMC_BITS $bits(gpu_types::gpu_tmc_t)
typedef struct packed {
logic valid;
logic [`NUM_WARPS-1:0] wmask;
logic [31:0] pc;
} gpu_wspawn_t;
`define GPU_WSPAWN_BITS $bits(gpu_types::gpu_wspawn_t)
typedef struct packed {
logic valid;
logic diverged;
logic [`NUM_THREADS-1:0] then_tmask;
logic [`NUM_THREADS-1:0] else_tmask;
logic [31:0] pc;
} gpu_split_t;
`define GPU_SPLIT_BITS $bits(gpu_types::gpu_split_t)
typedef struct packed {
logic valid;
logic [`NB_BITS-1:0] id;
logic [`NW_BITS-1:0] size_m1;
} gpu_barrier_t;
`define GPU_BARRIER_BITS $bits(gpu_types::gpu_barrier_t)
endpackage
`endif

View File

@@ -1,220 +0,0 @@
`include "VX_define.vh"
module VX_gpu_unit #(
parameter CORE_ID = 0
) (
`SCOPE_IO_VX_gpu_unit
input wire clk,
input wire reset,
// Inputs
VX_gpu_req_if.slave gpu_req_if,
`ifdef EXT_TEX_ENABLE
// PERF
`ifdef PERF_ENABLE
VX_perf_tex_if.master perf_tex_if,
`endif
VX_dcache_req_if.master dcache_req_if,
VX_dcache_rsp_if.slave dcache_rsp_if,
VX_tex_csr_if.slave tex_csr_if,
`endif
// Outputs
VX_warp_ctl_if.master warp_ctl_if,
VX_commit_if.master gpu_commit_if
);
import gpu_types::*;
`UNUSED_PARAM (CORE_ID)
localparam WCTL_DATAW = `GPU_TMC_BITS + `GPU_WSPAWN_BITS + `GPU_SPLIT_BITS + `GPU_BARRIER_BITS;
localparam RSP_DATAW = `MAX(`NUM_THREADS * 32, WCTL_DATAW);
wire rsp_valid;
wire [`UUID_BITS-1:0] rsp_uuid;
wire [`NW_BITS-1:0] rsp_wid;
wire [`NUM_THREADS-1:0] rsp_tmask;
wire [31:0] rsp_PC;
wire [`NR_BITS-1:0] rsp_rd;
wire rsp_wb;
wire [RSP_DATAW-1:0] rsp_data, rsp_data_r;
gpu_tmc_t tmc;
gpu_wspawn_t wspawn;
gpu_barrier_t barrier;
gpu_split_t split;
wire [WCTL_DATAW-1:0] warp_ctl_data;
wire is_warp_ctl;
wire stall_in, stall_out;
wire is_wspawn = (gpu_req_if.op_type == `INST_GPU_WSPAWN);
wire is_tmc = (gpu_req_if.op_type == `INST_GPU_TMC);
wire is_split = (gpu_req_if.op_type == `INST_GPU_SPLIT);
wire is_bar = (gpu_req_if.op_type == `INST_GPU_BAR);
wire is_pred = (gpu_req_if.op_type == `INST_GPU_PRED);
wire [31:0] rs1_data = gpu_req_if.rs1_data[gpu_req_if.tid];
wire [31:0] rs2_data = gpu_req_if.rs2_data[gpu_req_if.tid];
wire [`NUM_THREADS-1:0] taken_tmask;
wire [`NUM_THREADS-1:0] not_taken_tmask;
for (genvar i = 0; i < `NUM_THREADS; i++) begin
wire taken = (gpu_req_if.rs1_data[i] != 0);
assign taken_tmask[i] = gpu_req_if.tmask[i] & taken;
assign not_taken_tmask[i] = gpu_req_if.tmask[i] & ~taken;
end
// tmc
wire [`NUM_THREADS-1:0] pred_mask = (taken_tmask != 0) ? taken_tmask : gpu_req_if.tmask;
assign tmc.valid = is_tmc || is_pred;
assign tmc.tmask = is_pred ? pred_mask : rs1_data[`NUM_THREADS-1:0];
// wspawn
wire [31:0] wspawn_pc = rs2_data;
wire [`NUM_WARPS-1:0] wspawn_wmask;
for (genvar i = 0; i < `NUM_WARPS; i++) begin
assign wspawn_wmask[i] = (i < rs1_data);
end
assign wspawn.valid = is_wspawn;
assign wspawn.wmask = wspawn_wmask;
assign wspawn.pc = wspawn_pc;
// split
assign split.valid = is_split;
assign split.diverged = (| taken_tmask) && (| not_taken_tmask);
assign split.then_tmask = taken_tmask;
assign split.else_tmask = not_taken_tmask;
assign split.pc = gpu_req_if.next_PC;
// barrier
assign barrier.valid = is_bar;
assign barrier.id = rs1_data[`NB_BITS-1:0];
assign barrier.size_m1 = (`NW_BITS)'(rs2_data - 1);
// pack warp ctl result
assign warp_ctl_data = {tmc, wspawn, split, barrier};
// texture
`ifdef EXT_TEX_ENABLE
`UNUSED_VAR (gpu_req_if.op_mod)
VX_tex_req_if tex_req_if();
VX_tex_rsp_if tex_rsp_if();
wire is_tex = (gpu_req_if.op_type == `INST_GPU_TEX);
assign tex_req_if.valid = gpu_req_if.valid && is_tex;
assign tex_req_if.uuid = gpu_req_if.uuid;
assign tex_req_if.wid = gpu_req_if.wid;
assign tex_req_if.tmask = gpu_req_if.tmask;
assign tex_req_if.PC = gpu_req_if.PC;
assign tex_req_if.rd = gpu_req_if.rd;
assign tex_req_if.wb = gpu_req_if.wb;
assign tex_req_if.unit = gpu_req_if.op_mod[`NTEX_BITS-1:0];
assign tex_req_if.coords[0] = gpu_req_if.rs1_data;
assign tex_req_if.coords[1] = gpu_req_if.rs2_data;
assign tex_req_if.lod = gpu_req_if.rs3_data;
VX_tex_unit #(
.CORE_ID(CORE_ID)
) tex_unit (
.clk (clk),
.reset (reset),
`ifdef PERF_ENABLE
.perf_tex_if (perf_tex_if),
`endif
.tex_req_if (tex_req_if),
.tex_csr_if (tex_csr_if),
.tex_rsp_if (tex_rsp_if),
.dcache_req_if (dcache_req_if),
.dcache_rsp_if (dcache_rsp_if)
);
assign tex_rsp_if.ready = !stall_out;
assign stall_in = (is_tex && ~tex_req_if.ready)
|| (~is_tex && (tex_rsp_if.valid || stall_out));
assign is_warp_ctl = !(is_tex || tex_rsp_if.valid);
assign rsp_valid = tex_rsp_if.valid || (gpu_req_if.valid && ~is_tex);
assign rsp_uuid = tex_rsp_if.valid ? tex_rsp_if.uuid : gpu_req_if.uuid;
assign rsp_wid = tex_rsp_if.valid ? tex_rsp_if.wid : gpu_req_if.wid;
assign rsp_tmask = tex_rsp_if.valid ? tex_rsp_if.tmask : gpu_req_if.tmask;
assign rsp_PC = tex_rsp_if.valid ? tex_rsp_if.PC : gpu_req_if.PC;
assign rsp_rd = tex_rsp_if.rd;
assign rsp_wb = tex_rsp_if.valid && tex_rsp_if.wb;
assign rsp_data = tex_rsp_if.valid ? RSP_DATAW'(tex_rsp_if.data) : RSP_DATAW'(warp_ctl_data);
`else
`UNUSED_VAR (gpu_req_if.op_mod)
`UNUSED_VAR (gpu_req_if.rs3_data)
`UNUSED_VAR (gpu_req_if.wb)
`UNUSED_VAR (gpu_req_if.rd)
assign stall_in = stall_out;
assign is_warp_ctl = 1;
assign rsp_valid = gpu_req_if.valid;
assign rsp_uuid = gpu_req_if.uuid;
assign rsp_wid = gpu_req_if.wid;
assign rsp_tmask = gpu_req_if.tmask;
assign rsp_PC = gpu_req_if.PC;
assign rsp_rd = 0;
assign rsp_wb = 0;
assign rsp_data = RSP_DATAW'(warp_ctl_data);
`endif
wire is_warp_ctl_r;
// output
assign stall_out = ~gpu_commit_if.ready && gpu_commit_if.valid;
VX_pipe_register #(
.DATAW (1 + `UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `NR_BITS + 1 + RSP_DATAW + 1),
.RESETW (1)
) pipe_reg (
.clk (clk),
.reset (reset),
.enable (!stall_out),
.data_in ({rsp_valid, rsp_uuid, rsp_wid, rsp_tmask, rsp_PC, rsp_rd, rsp_wb, rsp_data, is_warp_ctl}),
.data_out ({gpu_commit_if.valid, gpu_commit_if.uuid, gpu_commit_if.wid, gpu_commit_if.tmask, gpu_commit_if.PC, gpu_commit_if.rd, gpu_commit_if.wb, rsp_data_r, is_warp_ctl_r})
);
assign gpu_commit_if.data = rsp_data_r[(`NUM_THREADS * 32)-1:0];
assign gpu_commit_if.eop = 1'b1;
// warp control reponse
assign {warp_ctl_if.tmc, warp_ctl_if.wspawn, warp_ctl_if.split, warp_ctl_if.barrier} = rsp_data_r[WCTL_DATAW-1:0];
assign warp_ctl_if.valid = gpu_commit_if.valid && gpu_commit_if.ready && is_warp_ctl_r;
assign warp_ctl_if.wid = gpu_commit_if.wid;
// can accept new request?
assign gpu_req_if.ready = ~stall_in;
`SCOPE_ASSIGN (gpu_rsp_valid, warp_ctl_if.valid);
`SCOPE_ASSIGN (gpu_rsp_uuid, gpu_commit_if.uuid);
`SCOPE_ASSIGN (gpu_rsp_tmc, warp_ctl_if.tmc.valid);
`SCOPE_ASSIGN (gpu_rsp_wspawn, warp_ctl_if.wspawn.valid);
`SCOPE_ASSIGN (gpu_rsp_split, warp_ctl_if.split.valid);
`SCOPE_ASSIGN (gpu_rsp_barrier, warp_ctl_if.barrier.valid);
endmodule

View File

@@ -1,210 +0,0 @@
`include "VX_define.vh"
module VX_ibuffer #(
parameter CORE_ID = 0
) (
input wire clk,
input wire reset,
// inputs
VX_decode_if.slave decode_if,
// outputs
VX_ibuffer_if.master ibuffer_if
);
`UNUSED_PARAM (CORE_ID)
localparam DATAW = `UUID_BITS + `NUM_THREADS + 32 + `EX_BITS + `INST_OP_BITS + `INST_FRM_BITS + 1 + (`NR_BITS * 4) + 32 + 1 + 1;
localparam ADDRW = $clog2(`IBUF_SIZE+1);
localparam NWARPSW = $clog2(`NUM_WARPS+1);
reg [`NUM_WARPS-1:0][ADDRW-1:0] used_r;
reg [`NUM_WARPS-1:0] full_r, empty_r, alm_empty_r;
wire [`NUM_WARPS-1:0] q_full, q_empty, q_alm_empty;
wire [DATAW-1:0] q_data_in;
wire [`NUM_WARPS-1:0][DATAW-1:0] q_data_prev;
reg [`NUM_WARPS-1:0][DATAW-1:0] q_data_out;
wire enq_fire = decode_if.valid && decode_if.ready;
wire deq_fire = ibuffer_if.valid && ibuffer_if.ready;
for (genvar i = 0; i < `NUM_WARPS; ++i) begin
wire writing = enq_fire && (i == decode_if.wid);
wire reading = deq_fire && (i == ibuffer_if.wid);
wire going_empty = empty_r[i] || (alm_empty_r[i] && reading);
VX_elastic_buffer #(
.DATAW (DATAW),
.SIZE (`IBUF_SIZE),
.OUT_REG (1)
) queue (
.clk (clk),
.reset (reset),
.valid_in (writing && !going_empty),
.data_in (q_data_in),
.ready_out(reading),
.data_out (q_data_prev[i]),
`UNUSED_PIN (ready_in),
`UNUSED_PIN (valid_out)
);
always @(posedge clk) begin
if (reset) begin
used_r[i] <= 0;
full_r[i] <= 0;
empty_r[i] <= 1;
alm_empty_r[i] <= 1;
end else begin
if (writing) begin
if (!reading) begin
empty_r[i] <= 0;
if (used_r[i] == 1)
alm_empty_r[i] <= 0;
if (used_r[i] == ADDRW'(`IBUF_SIZE))
full_r[i] <= 1;
end
end else if (reading) begin
full_r[i] <= 0;
if (used_r[i] == ADDRW'(1))
empty_r[i] <= 1;
if (used_r[i] == ADDRW'(2))
alm_empty_r[i] <= 1;
end
used_r[i] <= used_r[i] + ADDRW'($signed(2'(writing) - 2'(reading)));
end
if (writing && going_empty) begin
q_data_out[i] <= q_data_in;
end else if (reading) begin
q_data_out[i] <= q_data_prev[i];
end
end
assign q_full[i] = full_r[i];
assign q_empty[i] = empty_r[i];
assign q_alm_empty[i] = alm_empty_r[i];
end
///////////////////////////////////////////////////////////////////////////
reg [`NUM_WARPS-1:0] valid_table, valid_table_n;
reg [`NW_BITS-1:0] deq_wid, deq_wid_n;
reg [`NW_BITS-1:0] deq_wid_rr, deq_wid_rr_n;
reg deq_valid, deq_valid_n;
reg [DATAW-1:0] deq_instr, deq_instr_n;
reg [NWARPSW-1:0] num_warps;
`UNUSED_VAR (deq_instr)
// calculate valid table
always @(*) begin
valid_table_n = valid_table;
if (deq_fire) begin
valid_table_n[deq_wid] = !q_alm_empty[deq_wid];
end
if (enq_fire) begin
valid_table_n[decode_if.wid] = 1;
end
end
// round-robin warp scheduling
VX_rr_arbiter #(
.NUM_REQS (`NUM_WARPS)
) rr_arbiter (
.clk (clk),
.reset (reset),
.requests (valid_table_n),
.grant_index (deq_wid_rr_n),
`UNUSED_PIN (grant_valid),
`UNUSED_PIN (grant_onehot),
`UNUSED_PIN (enable)
);
// schedule the next instruction to issue
always @(*) begin
if (num_warps > 1) begin
deq_valid_n = 1;
deq_wid_n = deq_wid_rr;
deq_instr_n = q_data_out[deq_wid_rr];
end else if (1 == num_warps && !(deq_fire && q_alm_empty[deq_wid])) begin
deq_valid_n = 1;
deq_wid_n = deq_wid;
deq_instr_n = deq_fire ? q_data_prev[deq_wid] : q_data_out[deq_wid];
end else begin
deq_valid_n = enq_fire;
deq_wid_n = decode_if.wid;
deq_instr_n = q_data_in;
end
end
wire warp_added = enq_fire && q_empty[decode_if.wid];
wire warp_removed = deq_fire && ~(enq_fire && decode_if.wid == deq_wid) && q_alm_empty[deq_wid];
always @(posedge clk) begin
if (reset) begin
valid_table <= 0;
deq_valid <= 0;
num_warps <= 0;
end else begin
valid_table <= valid_table_n;
deq_valid <= deq_valid_n;
if (warp_added && !warp_removed) begin
num_warps <= num_warps + NWARPSW'(1);
end else if (warp_removed && !warp_added) begin
num_warps <= num_warps - NWARPSW'(1);
end
end
deq_wid <= deq_wid_n;
deq_wid_rr <= deq_wid_rr_n;
deq_instr <= deq_instr_n;
end
assign decode_if.ready = ~q_full[decode_if.wid];
assign q_data_in = {decode_if.uuid,
decode_if.tmask,
decode_if.PC,
decode_if.ex_type,
decode_if.op_type,
decode_if.op_mod,
decode_if.wb,
decode_if.use_PC,
decode_if.use_imm,
decode_if.imm,
decode_if.rd,
decode_if.rs1,
decode_if.rs2,
decode_if.rs3};
assign ibuffer_if.valid = deq_valid;
assign ibuffer_if.wid = deq_wid;
assign {ibuffer_if.uuid,
ibuffer_if.tmask,
ibuffer_if.PC,
ibuffer_if.ex_type,
ibuffer_if.op_type,
ibuffer_if.op_mod,
ibuffer_if.wb,
ibuffer_if.use_PC,
ibuffer_if.use_imm,
ibuffer_if.imm,
ibuffer_if.rd,
ibuffer_if.rs1,
ibuffer_if.rs2,
ibuffer_if.rs3} = deq_instr;
// scoreboard forwarding
assign ibuffer_if.wid_n = deq_wid_n;
assign ibuffer_if.rd_n = deq_instr_n[3*`NR_BITS +: `NR_BITS];
assign ibuffer_if.rs1_n = deq_instr_n[2*`NR_BITS +: `NR_BITS];
assign ibuffer_if.rs2_n = deq_instr_n[1*`NR_BITS +: `NR_BITS];
assign ibuffer_if.rs3_n = deq_instr_n[0*`NR_BITS +: `NR_BITS];
endmodule

View File

@@ -1,102 +0,0 @@
`include "VX_define.vh"
module VX_icache_stage #(
parameter CORE_ID = 0
) (
`SCOPE_IO_VX_icache_stage
input wire clk,
input wire reset,
// Icache interface
VX_icache_req_if.master icache_req_if,
VX_icache_rsp_if.slave icache_rsp_if,
// request
VX_ifetch_req_if.slave ifetch_req_if,
// reponse
VX_ifetch_rsp_if.master ifetch_rsp_if
);
`UNUSED_PARAM (CORE_ID)
`UNUSED_VAR (reset)
localparam OUT_REG = 0;
wire [`NW_BITS-1:0] req_tag, rsp_tag;
wire icache_req_fire = icache_req_if.valid && icache_req_if.ready;
assign req_tag = ifetch_req_if.wid;
assign rsp_tag = icache_rsp_if.tag[`NW_BITS-1:0];
wire [`UUID_BITS-1:0] rsp_uuid;
wire [31:0] rsp_PC;
wire [`NUM_THREADS-1:0] rsp_tmask;
VX_dp_ram #(
.DATAW (32 + `NUM_THREADS + `UUID_BITS),
.SIZE (`NUM_WARPS),
.LUTRAM (1)
) req_metadata (
.clk (clk),
.wren (icache_req_fire),
.waddr (req_tag),
.wdata ({ifetch_req_if.PC, ifetch_req_if.tmask, ifetch_req_if.uuid}),
.raddr (rsp_tag),
.rdata ({rsp_PC, rsp_tmask, rsp_uuid})
);
`RUNTIME_ASSERT((!ifetch_req_if.valid || ifetch_req_if.PC >= `STARTUP_ADDR),
("%t: *** invalid PC=%0h, wid=%0d, tmask=%b (#%0d)", $time, ifetch_req_if.PC, ifetch_req_if.wid, ifetch_req_if.tmask, ifetch_req_if.uuid))
// Icache Request
assign icache_req_if.valid = ifetch_req_if.valid;
assign icache_req_if.addr = ifetch_req_if.PC[31:2];
assign icache_req_if.tag = {ifetch_req_if.uuid, req_tag};
// Can accept new request?
assign ifetch_req_if.ready = icache_req_if.ready;
wire [`NW_BITS-1:0] rsp_wid = rsp_tag;
wire stall_out = ~ifetch_rsp_if.ready && (0 == OUT_REG && ifetch_rsp_if.valid);
VX_pipe_register #(
.DATAW (1 + `NW_BITS + `NUM_THREADS + 32 + 32 + `UUID_BITS),
.RESETW (1),
.DEPTH (OUT_REG)
) pipe_reg (
.clk (clk),
.reset (reset),
.enable (!stall_out),
.data_in ({icache_rsp_if.valid, rsp_wid, rsp_tmask, rsp_PC, icache_rsp_if.data, rsp_uuid}),
.data_out ({ifetch_rsp_if.valid, ifetch_rsp_if.wid, ifetch_rsp_if.tmask, ifetch_rsp_if.PC, ifetch_rsp_if.data, ifetch_rsp_if.uuid})
);
// Can accept new response?
assign icache_rsp_if.ready = ~stall_out;
`SCOPE_ASSIGN (icache_req_fire, icache_req_fire);
`SCOPE_ASSIGN (icache_req_uuid, ifetch_req_if.uuid);
`SCOPE_ASSIGN (icache_req_addr, {icache_req_if.addr, 2'b0});
`SCOPE_ASSIGN (icache_req_tag, req_tag);
`SCOPE_ASSIGN (icache_rsp_fire, icache_rsp_if.valid && icache_rsp_if.ready);
`SCOPE_ASSIGN (icache_rsp_uuid, rsp_uuid);
`SCOPE_ASSIGN (icache_rsp_data, icache_rsp_if.data);
`SCOPE_ASSIGN (icache_rsp_tag, rsp_tag);
`ifdef DBG_TRACE_CORE_ICACHE
always @(posedge clk) begin
if (icache_req_fire) begin
dpi_trace("%d: I$%0d req: wid=%0d, PC=%0h (#%0d)\n", $time, CORE_ID, ifetch_req_if.wid, ifetch_req_if.PC, ifetch_req_if.uuid);
end
if (ifetch_rsp_if.valid && ifetch_rsp_if.ready) begin
dpi_trace("%d: I$%0d rsp: wid=%0d, PC=%0h, data=%0h (#%0d)\n", $time, CORE_ID, ifetch_rsp_if.wid, ifetch_rsp_if.PC, ifetch_rsp_if.data, ifetch_rsp_if.uuid);
end
end
`endif
endmodule

View File

@@ -1,68 +0,0 @@
`include "VX_platform.vh"
module VX_ipdom_stack #(
parameter WIDTH = 1,
parameter DEPTH = 1
) (
input wire clk,
input wire reset,
input wire pair,
input wire [WIDTH - 1:0] q1,
input wire [WIDTH - 1:0] q2,
output wire [WIDTH - 1:0] d,
input wire push,
input wire pop,
output wire index,
output wire empty,
output wire full
);
localparam ADDRW = $clog2(DEPTH);
reg is_part [DEPTH-1:0];
reg [ADDRW-1:0] rd_ptr, wr_ptr;
wire [WIDTH-1:0] d1, d2;
always @(posedge clk) begin
if (reset) begin
rd_ptr <= 0;
wr_ptr <= 0;
end else begin
if (push) begin
rd_ptr <= wr_ptr;
wr_ptr <= wr_ptr + ADDRW'(1);
end else if (pop) begin
wr_ptr <= wr_ptr - ADDRW'(is_part[rd_ptr]);
rd_ptr <= rd_ptr - ADDRW'(is_part[rd_ptr]);
end
end
end
VX_dp_ram #(
.DATAW (WIDTH * 2),
.SIZE (DEPTH),
.LUTRAM (1)
) store (
.clk (clk),
.wren (push),
.waddr (wr_ptr),
.wdata ({q2, q1}),
.raddr (rd_ptr),
.rdata ({d2, d1})
);
always @(posedge clk) begin
if (push) begin
is_part[wr_ptr] <= ~pair;
end else if (pop) begin
is_part[rd_ptr] <= 1;
end
end
assign index = is_part[rd_ptr];
assign d = index ? d1 : d2;
assign empty = (ADDRW'(0) == wr_ptr);
assign full = (ADDRW'(DEPTH-1) == wr_ptr);
endmodule

View File

@@ -1,256 +0,0 @@
`include "VX_define.vh"
module VX_issue #(
parameter CORE_ID = 0
) (
`SCOPE_IO_VX_issue
input wire clk,
input wire reset,
`ifdef PERF_ENABLE
VX_perf_pipeline_if.issue perf_issue_if,
`endif
VX_decode_if.slave decode_if,
VX_writeback_if.slave writeback_if,
VX_alu_req_if.master alu_req_if,
VX_lsu_req_if.master lsu_req_if,
VX_csr_req_if.master csr_req_if,
`ifdef EXT_F_ENABLE
VX_fpu_req_if.master fpu_req_if,
`endif
VX_gpu_req_if.master gpu_req_if
);
VX_ibuffer_if ibuffer_if();
VX_gpr_req_if gpr_req_if();
VX_gpr_rsp_if gpr_rsp_if();
VX_writeback_if sboard_wb_if();
VX_ibuffer_if scoreboard_if();
VX_ibuffer_if dispatch_if();
// GPR request interface
assign gpr_req_if.wid = ibuffer_if.wid;
assign gpr_req_if.rs1 = ibuffer_if.rs1;
assign gpr_req_if.rs2 = ibuffer_if.rs2;
assign gpr_req_if.rs3 = ibuffer_if.rs3;
// scoreboard writeback interface
assign sboard_wb_if.valid = writeback_if.valid;
assign sboard_wb_if.uuid = writeback_if.uuid;
assign sboard_wb_if.wid = writeback_if.wid;
assign sboard_wb_if.PC = writeback_if.PC;
assign sboard_wb_if.rd = writeback_if.rd;
assign sboard_wb_if.eop = writeback_if.eop;
// scoreboard interface
assign scoreboard_if.valid = ibuffer_if.valid && dispatch_if.ready;
assign scoreboard_if.uuid = ibuffer_if.uuid;
assign scoreboard_if.wid = ibuffer_if.wid;
assign scoreboard_if.PC = ibuffer_if.PC;
assign scoreboard_if.wb = ibuffer_if.wb;
assign scoreboard_if.rd = ibuffer_if.rd;
assign scoreboard_if.rd_n = ibuffer_if.rd_n;
assign scoreboard_if.rs1_n = ibuffer_if.rs1_n;
assign scoreboard_if.rs2_n = ibuffer_if.rs2_n;
assign scoreboard_if.rs3_n = ibuffer_if.rs3_n;
assign scoreboard_if.wid_n = ibuffer_if.wid_n;
// dispatch interface
assign dispatch_if.valid = ibuffer_if.valid && scoreboard_if.ready;
assign dispatch_if.uuid = ibuffer_if.uuid;
assign dispatch_if.wid = ibuffer_if.wid;
assign dispatch_if.tmask = ibuffer_if.tmask;
assign dispatch_if.PC = ibuffer_if.PC;
assign dispatch_if.ex_type = ibuffer_if.ex_type;
assign dispatch_if.op_type = ibuffer_if.op_type;
assign dispatch_if.op_mod = ibuffer_if.op_mod;
assign dispatch_if.wb = ibuffer_if.wb;
assign dispatch_if.rd = ibuffer_if.rd;
assign dispatch_if.rs1 = ibuffer_if.rs1;
assign dispatch_if.imm = ibuffer_if.imm;
assign dispatch_if.use_PC = ibuffer_if.use_PC;
assign dispatch_if.use_imm = ibuffer_if.use_imm;
// issue the instruction
assign ibuffer_if.ready = scoreboard_if.ready && dispatch_if.ready;
`RESET_RELAY (ibuf_reset);
`RESET_RELAY (scoreboard_reset);
`RESET_RELAY (gpr_reset);
`RESET_RELAY (dispatch_reset);
VX_ibuffer #(
.CORE_ID(CORE_ID)
) ibuffer (
.clk (clk),
.reset (ibuf_reset),
.decode_if (decode_if),
.ibuffer_if (ibuffer_if)
);
VX_scoreboard #(
.CORE_ID(CORE_ID)
) scoreboard (
.clk (clk),
.reset (scoreboard_reset),
.writeback_if(sboard_wb_if),
.ibuffer_if (scoreboard_if)
);
VX_gpr_stage #(
.CORE_ID(CORE_ID)
) gpr_stage (
.clk (clk),
.reset (gpr_reset),
.writeback_if (writeback_if),
.gpr_req_if (gpr_req_if),
.gpr_rsp_if (gpr_rsp_if)
);
VX_dispatch dispatch (
.clk (clk),
.reset (dispatch_reset),
.ibuffer_if (dispatch_if),
.gpr_rsp_if (gpr_rsp_if),
.alu_req_if (alu_req_if),
.lsu_req_if (lsu_req_if),
.csr_req_if (csr_req_if),
`ifdef EXT_F_ENABLE
.fpu_req_if (fpu_req_if),
`endif
.gpu_req_if (gpu_req_if)
);
`SCOPE_ASSIGN (issue_fire, ibuffer_if.valid && ibuffer_if.ready);
`SCOPE_ASSIGN (issue_uuid, ibuffer_if.uuid);
`SCOPE_ASSIGN (issue_tmask, ibuffer_if.tmask);
`SCOPE_ASSIGN (issue_ex_type, ibuffer_if.ex_type);
`SCOPE_ASSIGN (issue_op_type, ibuffer_if.op_type);
`SCOPE_ASSIGN (issue_op_mod, ibuffer_if.op_mod);
`SCOPE_ASSIGN (issue_wb, ibuffer_if.wb);
`SCOPE_ASSIGN (issue_rd, ibuffer_if.rd);
`SCOPE_ASSIGN (issue_rs1, ibuffer_if.rs1);
`SCOPE_ASSIGN (issue_rs2, ibuffer_if.rs2);
`SCOPE_ASSIGN (issue_rs3, ibuffer_if.rs3);
`SCOPE_ASSIGN (issue_imm, ibuffer_if.imm);
`SCOPE_ASSIGN (issue_use_pc, ibuffer_if.use_PC);
`SCOPE_ASSIGN (issue_use_imm, ibuffer_if.use_imm);
`SCOPE_ASSIGN (scoreboard_delay, !scoreboard_if.ready);
`SCOPE_ASSIGN (dispatch_delay, !dispatch_if.ready);
`SCOPE_ASSIGN (gpr_rs1, gpr_rsp_if.rs1_data);
`SCOPE_ASSIGN (gpr_rs2, gpr_rsp_if.rs2_data);
`SCOPE_ASSIGN (gpr_rs3, gpr_rsp_if.rs3_data);
`SCOPE_ASSIGN (writeback_valid, writeback_if.valid);
`SCOPE_ASSIGN (writeback_uuid, writeback_if.uuid);
`SCOPE_ASSIGN (writeback_tmask, writeback_if.tmask);
`SCOPE_ASSIGN (writeback_rd, writeback_if.rd);
`SCOPE_ASSIGN (writeback_data, writeback_if.data);
`SCOPE_ASSIGN (writeback_eop, writeback_if.eop);
`ifdef PERF_ENABLE
reg [`PERF_CTR_BITS-1:0] perf_ibf_stalls;
reg [`PERF_CTR_BITS-1:0] perf_scb_stalls;
reg [`PERF_CTR_BITS-1:0] perf_alu_stalls;
reg [`PERF_CTR_BITS-1:0] perf_lsu_stalls;
reg [`PERF_CTR_BITS-1:0] perf_csr_stalls;
reg [`PERF_CTR_BITS-1:0] perf_gpu_stalls;
`ifdef EXT_F_ENABLE
reg [`PERF_CTR_BITS-1:0] perf_fpu_stalls;
`endif
always @(posedge clk) begin
if (reset) begin
perf_ibf_stalls <= 0;
perf_scb_stalls <= 0;
perf_alu_stalls <= 0;
perf_lsu_stalls <= 0;
perf_csr_stalls <= 0;
perf_gpu_stalls <= 0;
`ifdef EXT_F_ENABLE
perf_fpu_stalls <= 0;
`endif
end else begin
if (decode_if.valid & ~decode_if.ready) begin
perf_ibf_stalls <= perf_ibf_stalls + `PERF_CTR_BITS'd1;
end
if (scoreboard_if.valid & ~scoreboard_if.ready) begin
perf_scb_stalls <= perf_scb_stalls + `PERF_CTR_BITS'd1;
end
if (dispatch_if.valid & ~dispatch_if.ready) begin
case (dispatch_if.ex_type)
`EX_ALU: perf_alu_stalls <= perf_alu_stalls + `PERF_CTR_BITS'd1;
`ifdef EXT_F_ENABLE
`EX_FPU: perf_fpu_stalls <= perf_fpu_stalls + `PERF_CTR_BITS'd1;
`endif
`EX_LSU: perf_lsu_stalls <= perf_lsu_stalls + `PERF_CTR_BITS'd1;
`EX_CSR: perf_csr_stalls <= perf_csr_stalls + `PERF_CTR_BITS'd1;
//`EX_GPU:
default: perf_gpu_stalls <= perf_gpu_stalls + `PERF_CTR_BITS'd1;
endcase
end
end
end
assign perf_issue_if.ibf_stalls = perf_ibf_stalls;
assign perf_issue_if.scb_stalls = perf_scb_stalls;
assign perf_issue_if.alu_stalls = perf_alu_stalls;
assign perf_issue_if.lsu_stalls = perf_lsu_stalls;
assign perf_issue_if.csr_stalls = perf_csr_stalls;
assign perf_issue_if.gpu_stalls = perf_gpu_stalls;
`ifdef EXT_F_ENABLE
assign perf_issue_if.fpu_stalls = perf_fpu_stalls;
`endif
`endif
`ifdef DBG_TRACE_CORE_PIPELINE
always @(posedge clk) begin
if (alu_req_if.valid && alu_req_if.ready) begin
dpi_trace("%d: core%0d-issue: wid=%0d, PC=%0h, ex=ALU, tmask=%b, rd=%0d, rs1_data=",
$time, CORE_ID, alu_req_if.wid, alu_req_if.PC, alu_req_if.tmask, alu_req_if.rd);
`TRACE_ARRAY1D(alu_req_if.rs1_data, `NUM_THREADS);
dpi_trace(", rs2_data=");
`TRACE_ARRAY1D(alu_req_if.rs2_data, `NUM_THREADS);
dpi_trace(" (#%0d)\n", alu_req_if.uuid);
end
if (lsu_req_if.valid && lsu_req_if.ready) begin
dpi_trace("%d: core%0d-issue: wid=%0d, PC=%0h, ex=LSU, tmask=%b, rd=%0d, offset=%0h, addr=",
$time, CORE_ID, lsu_req_if.wid, lsu_req_if.PC, lsu_req_if.tmask, lsu_req_if.rd, lsu_req_if.offset);
`TRACE_ARRAY1D(lsu_req_if.base_addr, `NUM_THREADS);
dpi_trace(", data=");
`TRACE_ARRAY1D(lsu_req_if.store_data, `NUM_THREADS);
dpi_trace(" (#%0d)\n", lsu_req_if.uuid);
end
if (csr_req_if.valid && csr_req_if.ready) begin
dpi_trace("%d: core%0d-issue: wid=%0d, PC=%0h, ex=CSR, tmask=%b, rd=%0d, addr=%0h, rs1_data=",
$time, CORE_ID, csr_req_if.wid, csr_req_if.PC, csr_req_if.tmask, csr_req_if.rd, csr_req_if.addr);
`TRACE_ARRAY1D(csr_req_if.rs1_data, `NUM_THREADS);
dpi_trace(" (#%0d)\n", csr_req_if.uuid);
end
`ifdef EXT_F_ENABLE
if (fpu_req_if.valid && fpu_req_if.ready) begin
dpi_trace("%d: core%0d-issue: wid=%0d, PC=%0h, ex=FPU, tmask=%b, rd=%0d, rs1_data=",
$time, CORE_ID, fpu_req_if.wid, fpu_req_if.PC, fpu_req_if.tmask, fpu_req_if.rd);
`TRACE_ARRAY1D(fpu_req_if.rs1_data, `NUM_THREADS);
dpi_trace(", rs2_data=");
`TRACE_ARRAY1D(fpu_req_if.rs2_data, `NUM_THREADS);
dpi_trace(", rs3_data=");
`TRACE_ARRAY1D(fpu_req_if.rs3_data, `NUM_THREADS);
dpi_trace(" (#%0d)\n", fpu_req_if.uuid);
end
`endif
if (gpu_req_if.valid && gpu_req_if.ready) begin
dpi_trace("%d: core%0d-issue: wid=%0d, PC=%0h, ex=GPU, tmask=%b, rd=%0d, rs1_data=",
$time, CORE_ID, gpu_req_if.wid, gpu_req_if.PC, gpu_req_if.tmask, gpu_req_if.rd);
`TRACE_ARRAY1D(gpu_req_if.rs1_data, `NUM_THREADS);
dpi_trace(", rs2_data=");
`TRACE_ARRAY1D(gpu_req_if.rs2_data, `NUM_THREADS);
dpi_trace(", rs3_data=");
`TRACE_ARRAY1D(gpu_req_if.rs3_data, `NUM_THREADS);
dpi_trace(" (#%0d)\n", gpu_req_if.uuid);
end
end
`endif
endmodule

View File

@@ -1,372 +0,0 @@
`include "VX_define.vh"
module VX_lsu_unit #(
parameter CORE_ID = 0
) (
`SCOPE_IO_VX_lsu_unit
input wire clk,
input wire reset,
// Dcache interface
VX_dcache_req_if.master dcache_req_if,
VX_dcache_rsp_if.slave dcache_rsp_if,
// inputs
VX_lsu_req_if.slave lsu_req_if,
// outputs
VX_commit_if.master ld_commit_if,
VX_commit_if.master st_commit_if
);
localparam MEM_ASHIFT = `CLOG2(`MEM_BLOCK_SIZE);
localparam MEM_ADDRW = 32 - MEM_ASHIFT;
localparam REQ_ASHIFT = `CLOG2(`DCACHE_WORD_SIZE);
`STATIC_ASSERT(0 == (`IO_BASE_ADDR % MEM_ASHIFT), ("invalid parameter"))
`STATIC_ASSERT(0 == (`SMEM_BASE_ADDR % MEM_ASHIFT), ("invalid parameter"))
`STATIC_ASSERT(`SMEM_SIZE == `MEM_BLOCK_SIZE * (`SMEM_SIZE / `MEM_BLOCK_SIZE), ("invalid parameter"))
wire req_valid;
wire [`UUID_BITS-1:0] req_uuid;
wire [`NUM_THREADS-1:0] req_tmask;
wire [`NUM_THREADS-1:0][31:0] req_addr;
wire [`INST_LSU_BITS-1:0] req_type;
wire [`NUM_THREADS-1:0][31:0] req_data;
wire [`NR_BITS-1:0] req_rd;
wire req_wb;
wire [`NW_BITS-1:0] req_wid;
wire [31:0] req_pc;
wire req_is_dup;
wire req_is_prefetch;
wire mbuf_empty;
wire [`NUM_THREADS-1:0][`CACHE_ADDR_TYPE_BITS-1:0] lsu_addr_type, req_addr_type;
// full address calculation
wire [`NUM_THREADS-1:0][31:0] full_addr;
for (genvar i = 0; i < `NUM_THREADS; i++) begin
assign full_addr[i] = lsu_req_if.base_addr[i] + lsu_req_if.offset;
end
// detect duplicate addresses
wire [`NUM_THREADS-2:0] addr_matches;
for (genvar i = 0; i < (`NUM_THREADS-1); i++) begin
assign addr_matches[i] = (lsu_req_if.base_addr[i+1] == lsu_req_if.base_addr[0]) || ~lsu_req_if.tmask[i+1];
end
wire lsu_is_dup = lsu_req_if.tmask[0] && (& addr_matches);
for (genvar i = 0; i < `NUM_THREADS; i++) begin
// is non-cacheable address
wire is_addr_nc = (full_addr[i][MEM_ASHIFT +: MEM_ADDRW] >= MEM_ADDRW'(`IO_BASE_ADDR >> MEM_ASHIFT));
if (`SM_ENABLE) begin
// is shared memory address
wire is_addr_sm = (full_addr[i][MEM_ASHIFT +: MEM_ADDRW] >= MEM_ADDRW'((`SMEM_BASE_ADDR - `SMEM_SIZE) >> MEM_ASHIFT))
& (full_addr[i][MEM_ASHIFT +: MEM_ADDRW] < MEM_ADDRW'(`SMEM_BASE_ADDR >> MEM_ASHIFT));
assign lsu_addr_type[i] = {is_addr_nc, is_addr_sm};
end else begin
assign lsu_addr_type[i] = is_addr_nc;
end
end
// fence stalls the pipeline until all pending requests are sent
wire fence_wait = lsu_req_if.is_fence && (req_valid || !mbuf_empty);
wire ready_in;
wire stall_in = ~ready_in && req_valid;
wire lsu_valid = lsu_req_if.valid && ~fence_wait;
wire lsu_wb = lsu_req_if.wb | lsu_req_if.is_prefetch;
VX_pipe_register #(
.DATAW (1 + 1 + 1 + `UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + (`NUM_THREADS * 32) + (`NUM_THREADS * `CACHE_ADDR_TYPE_BITS) + `INST_LSU_BITS + `NR_BITS + 1 + (`NUM_THREADS * 32)),
.RESETW (1)
) req_pipe_reg (
.clk (clk),
.reset (reset),
.enable (!stall_in),
.data_in ({lsu_valid, lsu_is_dup, lsu_req_if.is_prefetch, lsu_req_if.uuid, lsu_req_if.wid, lsu_req_if.tmask, lsu_req_if.PC, full_addr, lsu_addr_type, lsu_req_if.op_type, lsu_req_if.rd, lsu_wb, lsu_req_if.store_data}),
.data_out ({req_valid, req_is_dup, req_is_prefetch, req_uuid, req_wid, req_tmask, req_pc, req_addr, req_addr_type, req_type, req_rd, req_wb, req_data})
);
// Can accept new request?
assign lsu_req_if.ready = ~stall_in && ~fence_wait;
wire [`UUID_BITS-1:0] rsp_uuid;
wire [`NW_BITS-1:0] rsp_wid;
wire [31:0] rsp_pc;
wire [`NR_BITS-1:0] rsp_rd;
wire rsp_wb;
wire [`INST_LSU_BITS-1:0] rsp_type;
wire rsp_is_dup;
wire rsp_is_prefetch;
reg [`LSUQ_SIZE-1:0][`NUM_THREADS-1:0] rsp_rem_mask;
wire [`NUM_THREADS-1:0] rsp_rem_mask_n;
wire [`NUM_THREADS-1:0] rsp_tmask;
reg [`NUM_THREADS-1:0] req_sent_mask;
reg is_req_start;
wire [`LSUQ_ADDR_BITS-1:0] mbuf_waddr, mbuf_raddr;
wire mbuf_full;
`UNUSED_VAR (rsp_type)
`UNUSED_VAR (rsp_is_prefetch)
wire [`NUM_THREADS-1:0][REQ_ASHIFT-1:0] req_offset, rsp_offset;
for (genvar i = 0; i < `NUM_THREADS; i++) begin
assign req_offset[i] = req_addr[i][1:0];
end
wire [`NUM_THREADS-1:0] dcache_req_fire = dcache_req_if.valid & dcache_req_if.ready;
wire dcache_rsp_fire = dcache_rsp_if.valid && dcache_rsp_if.ready;
wire [`NUM_THREADS-1:0] req_tmask_dup = req_tmask & {{(`NUM_THREADS-1){~req_is_dup}}, 1'b1};
wire mbuf_push = ~mbuf_full
&& (| ({`NUM_THREADS{req_valid}} & req_tmask_dup & dcache_req_if.ready))
&& is_req_start // first submission only
&& req_wb; // loads only
wire mbuf_pop = dcache_rsp_fire && (0 == rsp_rem_mask_n);
assign mbuf_raddr = dcache_rsp_if.tag[`CACHE_ADDR_TYPE_BITS +: `LSUQ_ADDR_BITS];
`UNUSED_VAR (dcache_rsp_if.tag)
// do not writeback from software prefetch
wire req_wb2 = req_wb && ~req_is_prefetch;
VX_index_buffer #(
.DATAW (`UUID_BITS + `NW_BITS + 32 + `NUM_THREADS + `NR_BITS + 1 + `INST_LSU_BITS + (`NUM_THREADS * REQ_ASHIFT) + 1 + 1),
.SIZE (`LSUQ_SIZE)
) req_metadata (
.clk (clk),
.reset (reset),
.write_addr (mbuf_waddr),
.acquire_slot (mbuf_push),
.read_addr (mbuf_raddr),
.write_data ({req_uuid, req_wid, req_pc, req_tmask, req_rd, req_wb2, req_type, req_offset, req_is_dup, req_is_prefetch}),
.read_data ({rsp_uuid, rsp_wid, rsp_pc, rsp_tmask, rsp_rd, rsp_wb, rsp_type, rsp_offset, rsp_is_dup, rsp_is_prefetch}),
.release_addr (mbuf_raddr),
.release_slot (mbuf_pop),
.full (mbuf_full),
.empty (mbuf_empty)
);
wire dcache_req_ready = &(dcache_req_if.ready | req_sent_mask | ~req_tmask_dup);
wire [`NUM_THREADS-1:0] req_sent_mask_n = req_sent_mask | dcache_req_fire;
always @(posedge clk) begin
if (reset) begin
req_sent_mask <= 0;
is_req_start <= 1;
end else begin
if (dcache_req_ready) begin
req_sent_mask <= 0;
is_req_start <= 1;
end else begin
req_sent_mask <= req_sent_mask_n;
is_req_start <= (0 == req_sent_mask_n);
end
end
end
// need to hold the acquired tag index until the full request is submitted
reg [`LSUQ_ADDR_BITS-1:0] req_tag_hold;
wire [`LSUQ_ADDR_BITS-1:0] req_tag = is_req_start ? mbuf_waddr : req_tag_hold;
always @(posedge clk) begin
if (mbuf_push) begin
req_tag_hold <= mbuf_waddr;
end
end
assign rsp_rem_mask_n = rsp_rem_mask[mbuf_raddr] & ~dcache_rsp_if.tmask;
always @(posedge clk) begin
if (mbuf_push) begin
rsp_rem_mask[mbuf_waddr] <= req_tmask_dup;
end
if (dcache_rsp_fire) begin
rsp_rem_mask[mbuf_raddr] <= rsp_rem_mask_n;
end
end
// ensure all dependencies for the requests are resolved
wire req_dep_ready = (req_wb && ~(mbuf_full && is_req_start))
|| (~req_wb && st_commit_if.ready);
// DCache Request
for (genvar i = 0; i < `NUM_THREADS; i++) begin
reg [3:0] mem_req_byteen;
reg [31:0] mem_req_data;
always @(*) begin
mem_req_byteen = {4{req_wb}};
case (`INST_LSU_WSIZE(req_type))
0: mem_req_byteen[req_offset[i]] = 1;
1: begin
mem_req_byteen[req_offset[i]] = 1;
mem_req_byteen[{req_offset[i][1], 1'b1}] = 1;
end
default : mem_req_byteen = {4{1'b1}};
endcase
end
always @(*) begin
mem_req_data = req_data[i];
case (req_offset[i])
1: mem_req_data[31:8] = req_data[i][23:0];
2: mem_req_data[31:16] = req_data[i][15:0];
3: mem_req_data[31:24] = req_data[i][7:0];
default:;
endcase
end
assign dcache_req_if.valid[i] = req_valid && req_dep_ready && req_tmask_dup[i] && !req_sent_mask[i];
assign dcache_req_if.rw[i] = ~req_wb;
assign dcache_req_if.addr[i] = req_addr[i][31:2];
assign dcache_req_if.byteen[i] = mem_req_byteen;
assign dcache_req_if.data[i] = mem_req_data;
assign dcache_req_if.tag[i] = {req_uuid, `LSU_TAG_ID_BITS'(req_tag), req_addr_type[i]};
end
assign ready_in = req_dep_ready && dcache_req_ready;
// send store commit
wire is_store_rsp = req_valid && ~req_wb && dcache_req_ready;
assign st_commit_if.valid = is_store_rsp;
assign st_commit_if.uuid = req_uuid;
assign st_commit_if.wid = req_wid;
assign st_commit_if.tmask = req_tmask;
assign st_commit_if.PC = req_pc;
assign st_commit_if.rd = 0;
assign st_commit_if.wb = 0;
assign st_commit_if.eop = 1'b1;
assign st_commit_if.data = 0;
// load response formatting
reg [`NUM_THREADS-1:0][31:0] rsp_data;
wire [`NUM_THREADS-1:0] rsp_tmask_qual;
for (genvar i = 0; i < `NUM_THREADS; i++) begin
wire [31:0] rsp_data32 = (i == 0 || rsp_is_dup) ? dcache_rsp_if.data[0] : dcache_rsp_if.data[i];
wire [15:0] rsp_data16 = rsp_offset[i][1] ? rsp_data32[31:16] : rsp_data32[15:0];
wire [7:0] rsp_data8 = rsp_offset[i][0] ? rsp_data16[15:8] : rsp_data16[7:0];
always @(*) begin
case (`INST_LSU_FMT(rsp_type))
`INST_FMT_B: rsp_data[i] = 32'(signed'(rsp_data8));
`INST_FMT_H: rsp_data[i] = 32'(signed'(rsp_data16));
`INST_FMT_BU: rsp_data[i] = 32'(unsigned'(rsp_data8));
`INST_FMT_HU: rsp_data[i] = 32'(unsigned'(rsp_data16));
default: rsp_data[i] = rsp_data32;
endcase
end
end
assign rsp_tmask_qual = rsp_is_dup ? rsp_tmask : dcache_rsp_if.tmask;
// send load commit
wire load_rsp_stall = ~ld_commit_if.ready && ld_commit_if.valid;
VX_pipe_register #(
.DATAW (1 + `UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `NR_BITS + 1 + (`NUM_THREADS * 32) + 1),
.RESETW (1)
) rsp_pipe_reg (
.clk (clk),
.reset (reset),
.enable (!load_rsp_stall),
.data_in ({dcache_rsp_if.valid, rsp_uuid, rsp_wid, rsp_tmask_qual, rsp_pc, rsp_rd, rsp_wb, rsp_data, mbuf_pop}),
.data_out ({ld_commit_if.valid, ld_commit_if.uuid, ld_commit_if.wid, ld_commit_if.tmask, ld_commit_if.PC, ld_commit_if.rd, ld_commit_if.wb, ld_commit_if.data, ld_commit_if.eop})
);
// Can accept new cache response?
assign dcache_rsp_if.ready = ~load_rsp_stall;
// scope registration
`SCOPE_ASSIGN (dcache_req_fire, dcache_req_fire);
`SCOPE_ASSIGN (dcache_req_uuid, req_uuid);
`SCOPE_ASSIGN (dcache_req_addr, req_addr);
`SCOPE_ASSIGN (dcache_req_rw, ~req_wb);
`SCOPE_ASSIGN (dcache_req_byteen,dcache_req_if.byteen);
`SCOPE_ASSIGN (dcache_req_data, dcache_req_if.data);
`SCOPE_ASSIGN (dcache_req_tag, req_tag);
`SCOPE_ASSIGN (dcache_rsp_fire, dcache_rsp_if.tmask & {`NUM_THREADS{dcache_rsp_fire}});
`SCOPE_ASSIGN (dcache_rsp_uuid, rsp_uuid);
`SCOPE_ASSIGN (dcache_rsp_data, dcache_rsp_if.data);
`SCOPE_ASSIGN (dcache_rsp_tag, mbuf_raddr);
`ifndef SYNTHESIS
reg [`LSUQ_SIZE-1:0][(`NW_BITS + 32 + `NR_BITS + `UUID_BITS + 64 + 1)-1:0] pending_reqs;
wire [63:0] delay_timeout = 10000 * (1 ** (`L2_ENABLE + `L3_ENABLE));
always @(posedge clk) begin
if (reset) begin
pending_reqs <= '0;
end begin
if (mbuf_push) begin
pending_reqs[mbuf_waddr] <= {req_wid, req_pc, req_rd, req_uuid, $time, 1'b1};
end
if (mbuf_pop) begin
pending_reqs[mbuf_raddr] <= '0;
end
end
for (integer i = 0; i < `LSUQ_SIZE; ++i) begin
if (pending_reqs[i][0]) begin
`ASSERT(($time - pending_reqs[i][1 +: 64]) < delay_timeout,
("%t: *** D$%0d response timeout: remaining=%b, wid=%0d, PC=%0h, rd=%0d (#%0d)",
$time, CORE_ID, rsp_rem_mask[i], pending_reqs[i][1+64+`UUID_BITS+`NR_BITS+32 +: `NW_BITS],
pending_reqs[i][1+64+`UUID_BITS+`NR_BITS +: 32],
pending_reqs[i][1+64+`UUID_BITS +: `NR_BITS],
pending_reqs[i][1+64 +: `UUID_BITS]));
end
end
end
`endif
`ifdef DBG_TRACE_CORE_DCACHE
wire dcache_req_fire_any = (| dcache_req_fire);
always @(posedge clk) begin
if (lsu_req_if.valid && fence_wait) begin
dpi_trace("%d: *** D$%0d fence wait\n", $time, CORE_ID);
end
if (dcache_req_fire_any) begin
if (dcache_req_if.rw[0]) begin
dpi_trace("%d: D$%0d Wr Req: wid=%0d, PC=%0h, tmask=%b, addr=", $time, CORE_ID, req_wid, req_pc, dcache_req_fire);
`TRACE_ARRAY1D(req_addr, `NUM_THREADS);
dpi_trace(", tag=%0h, byteen=%0h, type=", req_tag, dcache_req_if.byteen);
`TRACE_ARRAY1D(req_addr_type, `NUM_THREADS);
dpi_trace(", data=");
`TRACE_ARRAY1D(dcache_req_if.data, `NUM_THREADS);
dpi_trace(", (#%0d)\n", req_uuid);
end else begin
dpi_trace("%d: D$%0d Rd Req: prefetch=%b, wid=%0d, PC=%0h, tmask=%b, addr=", $time, CORE_ID, req_is_prefetch, req_wid, req_pc, dcache_req_fire);
`TRACE_ARRAY1D(req_addr, `NUM_THREADS);
dpi_trace(", tag=%0h, byteen=%0h, type=", req_tag, dcache_req_if.byteen);
`TRACE_ARRAY1D(req_addr_type, `NUM_THREADS);
dpi_trace(", rd=%0d, is_dup=%b (#%0d)\n", req_rd, req_is_dup, req_uuid);
end
end
if (dcache_rsp_fire) begin
dpi_trace("%d: D$%0d Rsp: prefetch=%b, wid=%0d, PC=%0h, tmask=%b, tag=%0h, rd=%0d, data=",
$time, CORE_ID, rsp_is_prefetch, rsp_wid, rsp_pc, dcache_rsp_if.tmask, mbuf_raddr, rsp_rd);
`TRACE_ARRAY1D(dcache_rsp_if.data, `NUM_THREADS);
dpi_trace(", is_dup=%b (#%0d)\n", rsp_is_dup, rsp_uuid);
end
end
`endif
endmodule

View File

@@ -1,146 +0,0 @@
`include "VX_define.vh"
module VX_mem_arb #(
parameter NUM_REQS = 1,
parameter DATA_WIDTH = 1,
parameter ADDR_WIDTH = 1,
parameter TAG_IN_WIDTH = 1,
parameter TAG_SEL_IDX = 0,
parameter BUFFERED_REQ = 0,
parameter BUFFERED_RSP = 0,
parameter TYPE = "P",
parameter DATA_SIZE = (DATA_WIDTH / 8),
parameter LOG_NUM_REQS = `CLOG2(NUM_REQS),
parameter TAG_OUT_WIDTH = TAG_IN_WIDTH + LOG_NUM_REQS
) (
input wire clk,
input wire reset,
// input requests
input wire [NUM_REQS-1:0] req_valid_in,
input wire [NUM_REQS-1:0][TAG_IN_WIDTH-1:0] req_tag_in,
input wire [NUM_REQS-1:0][ADDR_WIDTH-1:0] req_addr_in,
input wire [NUM_REQS-1:0] req_rw_in,
input wire [NUM_REQS-1:0][DATA_SIZE-1:0] req_byteen_in,
input wire [NUM_REQS-1:0][DATA_WIDTH-1:0] req_data_in,
output wire [NUM_REQS-1:0] req_ready_in,
// output request
output wire req_valid_out,
output wire [TAG_OUT_WIDTH-1:0] req_tag_out,
output wire [ADDR_WIDTH-1:0] req_addr_out,
output wire req_rw_out,
output wire [DATA_SIZE-1:0] req_byteen_out,
output wire [DATA_WIDTH-1:0] req_data_out,
input wire req_ready_out,
// input response
input wire rsp_valid_in,
input wire [TAG_OUT_WIDTH-1:0] rsp_tag_in,
input wire [DATA_WIDTH-1:0] rsp_data_in,
output wire rsp_ready_in,
// output responses
output wire [NUM_REQS-1:0] rsp_valid_out,
output wire [NUM_REQS-1:0][TAG_IN_WIDTH-1:0] rsp_tag_out,
output wire [NUM_REQS-1:0][DATA_WIDTH-1:0] rsp_data_out,
input wire [NUM_REQS-1:0] rsp_ready_out
);
localparam REQ_DATAW = TAG_OUT_WIDTH + ADDR_WIDTH + 1 + DATA_SIZE + DATA_WIDTH;
localparam RSP_DATAW = TAG_IN_WIDTH + DATA_WIDTH;
if (NUM_REQS > 1) begin
wire [NUM_REQS-1:0][REQ_DATAW-1:0] req_data_in_merged;
for (genvar i = 0; i < NUM_REQS; i++) begin
wire [TAG_OUT_WIDTH-1:0] req_tag_in_w;
VX_bits_insert #(
.N (TAG_IN_WIDTH),
.S (LOG_NUM_REQS),
.POS (TAG_SEL_IDX)
) bits_insert (
.data_in (req_tag_in[i]),
.sel_in (LOG_NUM_REQS'(i)),
.data_out (req_tag_in_w)
);
assign req_data_in_merged[i] = {req_tag_in_w, req_addr_in[i], req_rw_in[i], req_byteen_in[i], req_data_in[i]};
end
VX_stream_arbiter #(
.NUM_REQS (NUM_REQS),
.DATAW (REQ_DATAW),
.BUFFERED (BUFFERED_REQ),
.TYPE (TYPE)
) req_arb (
.clk (clk),
.reset (reset),
.valid_in (req_valid_in),
.data_in (req_data_in_merged),
.ready_in (req_ready_in),
.valid_out (req_valid_out),
.data_out ({req_tag_out, req_addr_out, req_rw_out, req_byteen_out, req_data_out}),
.ready_out (req_ready_out)
);
///////////////////////////////////////////////////////////////////////
wire [NUM_REQS-1:0][RSP_DATAW-1:0] rsp_data_out_merged;
wire [LOG_NUM_REQS-1:0] rsp_sel = rsp_tag_in[TAG_SEL_IDX +: LOG_NUM_REQS];
wire [TAG_IN_WIDTH-1:0] rsp_tag_in_w;
VX_bits_remove #(
.N (TAG_OUT_WIDTH),
.S (LOG_NUM_REQS),
.POS (TAG_SEL_IDX)
) bits_remove (
.data_in (rsp_tag_in),
.data_out (rsp_tag_in_w)
);
VX_stream_demux #(
.NUM_REQS (NUM_REQS),
.DATAW (RSP_DATAW),
.BUFFERED (BUFFERED_RSP)
) rsp_demux (
.clk (clk),
.reset (reset),
.sel_in (rsp_sel),
.valid_in (rsp_valid_in),
.data_in ({rsp_tag_in_w, rsp_data_in}),
.ready_in (rsp_ready_in),
.valid_out (rsp_valid_out),
.data_out (rsp_data_out_merged),
.ready_out (rsp_ready_out)
);
for (genvar i = 0; i < NUM_REQS; i++) begin
assign {rsp_tag_out[i], rsp_data_out[i]} = rsp_data_out_merged[i];
end
end else begin
`UNUSED_VAR (clk)
`UNUSED_VAR (reset)
assign req_valid_out = req_valid_in;
assign req_tag_out = req_tag_in;
assign req_addr_out = req_addr_in;
assign req_rw_out = req_rw_in;
assign req_byteen_out = req_byteen_in;
assign req_data_out = req_data_in;
assign req_ready_in = req_ready_out;
assign rsp_valid_out = rsp_valid_in;
assign rsp_tag_out = rsp_tag_in;
assign rsp_data_out = rsp_data_in;
assign rsp_ready_in = rsp_ready_out;
end
endmodule

View File

@@ -1,420 +0,0 @@
`include "VX_define.vh"
module VX_mem_unit # (
parameter CORE_ID = 0
) (
`SCOPE_IO_VX_mem_unit
input wire clk,
input wire reset,
`ifdef PERF_ENABLE
VX_perf_memsys_if.master perf_memsys_if,
`endif
// Core <-> Dcache
VX_dcache_req_if.slave dcache_req_if,
VX_dcache_rsp_if.master dcache_rsp_if,
// Core <-> Icache
VX_icache_req_if.slave icache_req_if,
VX_icache_rsp_if.master icache_rsp_if,
// Memory
VX_mem_req_if.master mem_req_if,
VX_mem_rsp_if.slave mem_rsp_if
);
`ifdef PERF_ENABLE
VX_perf_cache_if perf_icache_if(), perf_dcache_if(), perf_smem_if();
`endif
VX_mem_req_if #(
.DATA_WIDTH (`ICACHE_MEM_DATA_WIDTH),
.ADDR_WIDTH (`ICACHE_MEM_ADDR_WIDTH),
.TAG_WIDTH (`ICACHE_MEM_TAG_WIDTH)
) icache_mem_req_if();
VX_mem_rsp_if #(
.DATA_WIDTH (`ICACHE_MEM_DATA_WIDTH),
.TAG_WIDTH (`ICACHE_MEM_TAG_WIDTH)
) icache_mem_rsp_if();
VX_mem_req_if #(
.DATA_WIDTH (`DCACHE_MEM_DATA_WIDTH),
.ADDR_WIDTH (`DCACHE_MEM_ADDR_WIDTH),
.TAG_WIDTH (`DCACHE_MEM_TAG_WIDTH)
) dcache_mem_req_if();
VX_mem_rsp_if #(
.DATA_WIDTH (`DCACHE_MEM_DATA_WIDTH),
.TAG_WIDTH (`DCACHE_MEM_TAG_WIDTH)
) dcache_mem_rsp_if();
VX_dcache_req_if #(
.NUM_REQS (`DCACHE_NUM_REQS),
.WORD_SIZE (`DCACHE_WORD_SIZE),
.TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH-`SM_ENABLE)
) dcache_req_tmp_if();
VX_dcache_rsp_if #(
.NUM_REQS (`DCACHE_NUM_REQS),
.WORD_SIZE (`DCACHE_WORD_SIZE),
.TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH-`SM_ENABLE)
) dcache_rsp_tmp_if();
`RESET_RELAY (icache_reset);
`RESET_RELAY (dcache_reset);
`RESET_RELAY (mem_arb_reset);
VX_cache #(
.CACHE_ID (`ICACHE_ID),
.CACHE_SIZE (`ICACHE_SIZE),
.CACHE_LINE_SIZE (`ICACHE_LINE_SIZE),
.NUM_BANKS (1),
.WORD_SIZE (`ICACHE_WORD_SIZE),
.NUM_REQS (1),
.CREQ_SIZE (`ICACHE_CREQ_SIZE),
.CRSQ_SIZE (`ICACHE_CRSQ_SIZE),
.MSHR_SIZE (`ICACHE_MSHR_SIZE),
.MRSQ_SIZE (`ICACHE_MRSQ_SIZE),
.MREQ_SIZE (`ICACHE_MREQ_SIZE),
.WRITE_ENABLE (0),
.CORE_TAG_WIDTH (`ICACHE_CORE_TAG_WIDTH),
.CORE_TAG_ID_BITS (`ICACHE_CORE_TAG_ID_BITS),
.MEM_TAG_WIDTH (`ICACHE_MEM_TAG_WIDTH)
) icache (
`SCOPE_BIND_VX_mem_unit_icache
.clk (clk),
.reset (icache_reset),
// Core request
.core_req_valid (icache_req_if.valid),
.core_req_rw (1'b0),
.core_req_byteen ('b0),
.core_req_addr (icache_req_if.addr),
.core_req_data ('x),
.core_req_tag (icache_req_if.tag),
.core_req_ready (icache_req_if.ready),
// Core response
.core_rsp_valid (icache_rsp_if.valid),
.core_rsp_data (icache_rsp_if.data),
.core_rsp_tag (icache_rsp_if.tag),
.core_rsp_ready (icache_rsp_if.ready),
`UNUSED_PIN (core_rsp_tmask),
`ifdef PERF_ENABLE
.perf_cache_if (perf_icache_if),
`endif
// Memory Request
.mem_req_valid (icache_mem_req_if.valid),
.mem_req_rw (icache_mem_req_if.rw),
.mem_req_byteen (icache_mem_req_if.byteen),
.mem_req_addr (icache_mem_req_if.addr),
.mem_req_data (icache_mem_req_if.data),
.mem_req_tag (icache_mem_req_if.tag),
.mem_req_ready (icache_mem_req_if.ready),
// Memory response
.mem_rsp_valid (icache_mem_rsp_if.valid),
.mem_rsp_data (icache_mem_rsp_if.data),
.mem_rsp_tag (icache_mem_rsp_if.tag),
.mem_rsp_ready (icache_mem_rsp_if.ready)
);
VX_cache #(
.CACHE_ID (`DCACHE_ID),
.CACHE_SIZE (`DCACHE_SIZE),
.CACHE_LINE_SIZE (`DCACHE_LINE_SIZE),
.NUM_BANKS (`DCACHE_NUM_BANKS),
.NUM_PORTS (`DCACHE_NUM_PORTS),
.WORD_SIZE (`DCACHE_WORD_SIZE),
.NUM_REQS (`DCACHE_NUM_REQS),
.CREQ_SIZE (`DCACHE_CREQ_SIZE),
.CRSQ_SIZE (`DCACHE_CRSQ_SIZE),
.MSHR_SIZE (`DCACHE_MSHR_SIZE),
.MRSQ_SIZE (`DCACHE_MRSQ_SIZE),
.MREQ_SIZE (`DCACHE_MREQ_SIZE),
.WRITE_ENABLE (1),
.CORE_TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH-`SM_ENABLE),
.CORE_TAG_ID_BITS (`DCACHE_CORE_TAG_ID_BITS-`SM_ENABLE),
.MEM_TAG_WIDTH (`DCACHE_MEM_TAG_WIDTH),
.NC_ENABLE (1)
) dcache (
`SCOPE_BIND_VX_mem_unit_dcache
.clk (clk),
.reset (dcache_reset),
// Core req
.core_req_valid (dcache_req_tmp_if.valid),
.core_req_rw (dcache_req_tmp_if.rw),
.core_req_byteen (dcache_req_tmp_if.byteen),
.core_req_addr (dcache_req_tmp_if.addr),
.core_req_data (dcache_req_tmp_if.data),
.core_req_tag (dcache_req_tmp_if.tag),
.core_req_ready (dcache_req_tmp_if.ready),
// Core response
.core_rsp_valid (dcache_rsp_tmp_if.valid),
.core_rsp_tmask (dcache_rsp_tmp_if.tmask),
.core_rsp_data (dcache_rsp_tmp_if.data),
.core_rsp_tag (dcache_rsp_tmp_if.tag),
.core_rsp_ready (dcache_rsp_tmp_if.ready),
`ifdef PERF_ENABLE
.perf_cache_if (perf_dcache_if),
`endif
// Memory request
.mem_req_valid (dcache_mem_req_if.valid),
.mem_req_rw (dcache_mem_req_if.rw),
.mem_req_byteen (dcache_mem_req_if.byteen),
.mem_req_addr (dcache_mem_req_if.addr),
.mem_req_data (dcache_mem_req_if.data),
.mem_req_tag (dcache_mem_req_if.tag),
.mem_req_ready (dcache_mem_req_if.ready),
// Memory response
.mem_rsp_valid (dcache_mem_rsp_if.valid),
.mem_rsp_data (dcache_mem_rsp_if.data),
.mem_rsp_tag (dcache_mem_rsp_if.tag),
.mem_rsp_ready (dcache_mem_rsp_if.ready)
);
if (`SM_ENABLE) begin
VX_dcache_req_if #(
.NUM_REQS (`DCACHE_NUM_REQS),
.WORD_SIZE (`DCACHE_WORD_SIZE),
.TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH-`SM_ENABLE)
) smem_req_if();
VX_dcache_rsp_if #(
.NUM_REQS (`DCACHE_NUM_REQS),
.WORD_SIZE (`DCACHE_WORD_SIZE),
.TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH-`SM_ENABLE)
) smem_rsp_if();
`RESET_RELAY (smem_arb_reset);
`RESET_RELAY (smem_reset);
VX_smem_arb #(
.NUM_REQS (2),
.LANES (`NUM_THREADS),
.DATA_SIZE (4),
.TAG_IN_WIDTH (`DCACHE_CORE_TAG_WIDTH),
.TAG_SEL_IDX (0), // SM flag
.TYPE ("P"),
.BUFFERED_REQ (2),
.BUFFERED_RSP (1)
) smem_arb (
.clk (clk),
.reset (smem_arb_reset),
// input request
.req_valid_in (dcache_req_if.valid),
.req_rw_in (dcache_req_if.rw),
.req_byteen_in (dcache_req_if.byteen),
.req_addr_in (dcache_req_if.addr),
.req_data_in (dcache_req_if.data),
.req_tag_in (dcache_req_if.tag),
.req_ready_in (dcache_req_if.ready),
// output requests
.req_valid_out ({smem_req_if.valid, dcache_req_tmp_if.valid}),
.req_rw_out ({smem_req_if.rw, dcache_req_tmp_if.rw}),
.req_byteen_out ({smem_req_if.byteen, dcache_req_tmp_if.byteen}),
.req_addr_out ({smem_req_if.addr, dcache_req_tmp_if.addr}),
.req_data_out ({smem_req_if.data, dcache_req_tmp_if.data}),
.req_tag_out ({smem_req_if.tag, dcache_req_tmp_if.tag}),
.req_ready_out ({smem_req_if.ready, dcache_req_tmp_if.ready}),
// input responses
.rsp_valid_in ({smem_rsp_if.valid, dcache_rsp_tmp_if.valid}),
.rsp_tmask_in ({smem_rsp_if.tmask, dcache_rsp_tmp_if.tmask}),
.rsp_data_in ({smem_rsp_if.data, dcache_rsp_tmp_if.data}),
.rsp_tag_in ({smem_rsp_if.tag, dcache_rsp_tmp_if.tag}),
.rsp_ready_in ({smem_rsp_if.ready, dcache_rsp_tmp_if.ready}),
// output response
.rsp_valid_out (dcache_rsp_if.valid),
.rsp_tmask_out (dcache_rsp_if.tmask),
.rsp_tag_out (dcache_rsp_if.tag),
.rsp_data_out (dcache_rsp_if.data),
.rsp_ready_out (dcache_rsp_if.ready)
);
VX_shared_mem #(
.CACHE_ID (`SMEM_ID),
.CACHE_SIZE (`SMEM_SIZE),
.NUM_BANKS (`SMEM_NUM_BANKS),
.WORD_SIZE (`SMEM_WORD_SIZE),
.NUM_REQS (`SMEM_NUM_REQS),
.CREQ_SIZE (`SMEM_CREQ_SIZE),
.CRSQ_SIZE (`SMEM_CRSQ_SIZE),
.CORE_TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH-`SM_ENABLE),
.CORE_TAG_ID_BITS (`DCACHE_CORE_TAG_ID_BITS-`SM_ENABLE),
.BANK_ADDR_OFFSET (`SMEM_BANK_ADDR_OFFSET)
) smem (
.clk (clk),
.reset (smem_reset),
`ifdef PERF_ENABLE
.perf_cache_if (perf_smem_if),
`endif
// Core request
.core_req_valid (smem_req_if.valid),
.core_req_rw (smem_req_if.rw),
.core_req_byteen (smem_req_if.byteen),
.core_req_addr (smem_req_if.addr),
.core_req_data (smem_req_if.data),
.core_req_tag (smem_req_if.tag),
.core_req_ready (smem_req_if.ready),
// Core response
.core_rsp_valid (smem_rsp_if.valid),
.core_rsp_tmask (smem_rsp_if.tmask),
.core_rsp_data (smem_rsp_if.data),
.core_rsp_tag (smem_rsp_if.tag),
.core_rsp_ready (smem_rsp_if.ready)
);
end else begin
// core to D-cache request
for (genvar i = 0; i < `DCACHE_NUM_REQS; ++i) begin
VX_skid_buffer #(
.DATAW ((32-`CLOG2(`DCACHE_WORD_SIZE)) + 1 + `DCACHE_WORD_SIZE + (8*`DCACHE_WORD_SIZE) + `DCACHE_CORE_TAG_WIDTH)
) req_buf (
.clk (clk),
.reset (reset),
.valid_in (dcache_req_if.valid[i]),
.data_in ({dcache_req_if.addr[i], dcache_req_if.rw[i], dcache_req_if.byteen[i], dcache_req_if.data[i], dcache_req_if.tag[i]}),
.ready_in (dcache_req_if.ready[i]),
.valid_out (dcache_req_tmp_if.valid[i]),
.data_out ({dcache_req_tmp_if.addr[i], dcache_req_tmp_if.rw[i], dcache_req_tmp_if.byteen[i], dcache_req_tmp_if.data[i], dcache_req_tmp_if.tag[i]}),
.ready_out (dcache_req_tmp_if.ready[i])
);
end
// D-cache to core reponse
assign dcache_rsp_if.valid = dcache_rsp_tmp_if.valid;
assign dcache_rsp_if.tmask = dcache_rsp_tmp_if.tmask;
assign dcache_rsp_if.tag = dcache_rsp_tmp_if.tag;
assign dcache_rsp_if.data = dcache_rsp_tmp_if.data;
assign dcache_rsp_tmp_if.ready = dcache_rsp_if.ready;
end
wire [`DCACHE_MEM_TAG_WIDTH-1:0] icache_mem_req_tag = `DCACHE_MEM_TAG_WIDTH'(icache_mem_req_if.tag);
wire [`DCACHE_MEM_TAG_WIDTH-1:0] icache_mem_rsp_tag;
assign icache_mem_rsp_if.tag = icache_mem_rsp_tag[`ICACHE_MEM_TAG_WIDTH-1:0];
`UNUSED_VAR (icache_mem_rsp_tag)
VX_mem_arb #(
.NUM_REQS (2),
.DATA_WIDTH (`DCACHE_MEM_DATA_WIDTH),
.ADDR_WIDTH (`DCACHE_MEM_ADDR_WIDTH),
.TAG_IN_WIDTH (`DCACHE_MEM_TAG_WIDTH),
.TYPE ("R"),
.TAG_SEL_IDX (1), // Skip 0 for NC flag
.BUFFERED_REQ (1),
.BUFFERED_RSP (2)
) mem_arb (
.clk (clk),
.reset (mem_arb_reset),
// Source request
.req_valid_in ({dcache_mem_req_if.valid, icache_mem_req_if.valid}),
.req_rw_in ({dcache_mem_req_if.rw, icache_mem_req_if.rw}),
.req_byteen_in ({dcache_mem_req_if.byteen, icache_mem_req_if.byteen}),
.req_addr_in ({dcache_mem_req_if.addr, icache_mem_req_if.addr}),
.req_data_in ({dcache_mem_req_if.data, icache_mem_req_if.data}),
.req_tag_in ({dcache_mem_req_if.tag, icache_mem_req_tag}),
.req_ready_in ({dcache_mem_req_if.ready, icache_mem_req_if.ready}),
// Memory request
.req_valid_out (mem_req_if.valid),
.req_rw_out (mem_req_if.rw),
.req_byteen_out (mem_req_if.byteen),
.req_addr_out (mem_req_if.addr),
.req_data_out (mem_req_if.data),
.req_tag_out (mem_req_if.tag),
.req_ready_out (mem_req_if.ready),
// Source response
.rsp_valid_out ({dcache_mem_rsp_if.valid, icache_mem_rsp_if.valid}),
.rsp_data_out ({dcache_mem_rsp_if.data, icache_mem_rsp_if.data}),
.rsp_tag_out ({dcache_mem_rsp_if.tag, icache_mem_rsp_tag}),
.rsp_ready_out ({dcache_mem_rsp_if.ready, icache_mem_rsp_if.ready}),
// Memory response
.rsp_valid_in (mem_rsp_if.valid),
.rsp_tag_in (mem_rsp_if.tag),
.rsp_data_in (mem_rsp_if.data),
.rsp_ready_in (mem_rsp_if.ready)
);
`ifdef PERF_ENABLE
`UNUSED_VAR (perf_dcache_if.mem_stalls)
`UNUSED_VAR (perf_dcache_if.crsp_stalls)
assign perf_memsys_if.icache_reads = perf_icache_if.reads;
assign perf_memsys_if.icache_read_misses = perf_icache_if.read_misses;
assign perf_memsys_if.dcache_reads = perf_dcache_if.reads;
assign perf_memsys_if.dcache_writes = perf_dcache_if.writes;
assign perf_memsys_if.dcache_read_misses = perf_dcache_if.read_misses;
assign perf_memsys_if.dcache_write_misses= perf_dcache_if.write_misses;
assign perf_memsys_if.dcache_bank_stalls = perf_dcache_if.bank_stalls;
assign perf_memsys_if.dcache_mshr_stalls = perf_dcache_if.mshr_stalls;
if (`SM_ENABLE) begin
assign perf_memsys_if.smem_reads = perf_smem_if.reads;
assign perf_memsys_if.smem_writes = perf_smem_if.writes;
assign perf_memsys_if.smem_bank_stalls = perf_smem_if.bank_stalls;
end else begin
assign perf_memsys_if.smem_reads = 0;
assign perf_memsys_if.smem_writes = 0;
assign perf_memsys_if.smem_bank_stalls = 0;
end
reg [`PERF_CTR_BITS-1:0] perf_mem_pending_reads;
always @(posedge clk) begin
if (reset) begin
perf_mem_pending_reads <= 0;
end else begin
perf_mem_pending_reads <= perf_mem_pending_reads +
`PERF_CTR_BITS'($signed(2'((mem_req_if.valid && mem_req_if.ready && !mem_req_if.rw) && !(mem_rsp_if.valid && mem_rsp_if.ready)) -
2'((mem_rsp_if.valid && mem_rsp_if.ready) && !(mem_req_if.valid && mem_req_if.ready && !mem_req_if.rw))));
end
end
reg [`PERF_CTR_BITS-1:0] perf_mem_reads;
reg [`PERF_CTR_BITS-1:0] perf_mem_writes;
reg [`PERF_CTR_BITS-1:0] perf_mem_lat;
always @(posedge clk) begin
if (reset) begin
perf_mem_reads <= 0;
perf_mem_writes <= 0;
perf_mem_lat <= 0;
end else begin
if (mem_req_if.valid && mem_req_if.ready && !mem_req_if.rw) begin
perf_mem_reads <= perf_mem_reads + `PERF_CTR_BITS'd1;
end
if (mem_req_if.valid && mem_req_if.ready && mem_req_if.rw) begin
perf_mem_writes <= perf_mem_writes + `PERF_CTR_BITS'd1;
end
perf_mem_lat <= perf_mem_lat + perf_mem_pending_reads;
end
end
assign perf_memsys_if.mem_reads = perf_mem_reads;
assign perf_memsys_if.mem_writes = perf_mem_writes;
assign perf_memsys_if.mem_latency = perf_mem_lat;
`endif
endmodule

View File

@@ -1,226 +0,0 @@
`include "VX_define.vh"
module VX_muldiv (
input wire clk,
input wire reset,
// Inputs
input wire [`INST_MUL_BITS-1:0] alu_op,
input wire [`UUID_BITS-1:0] uuid_in,
input wire [`NW_BITS-1:0] wid_in,
input wire [`NUM_THREADS-1:0] tmask_in,
input wire [31:0] PC_in,
input wire [`NR_BITS-1:0] rd_in,
input wire wb_in,
input wire [`NUM_THREADS-1:0][31:0] alu_in1,
input wire [`NUM_THREADS-1:0][31:0] alu_in2,
// Outputs
output wire [`UUID_BITS-1:0] uuid_out,
output wire [`NW_BITS-1:0] wid_out,
output wire [`NUM_THREADS-1:0] tmask_out,
output wire [31:0] PC_out,
output wire [`NR_BITS-1:0] rd_out,
output wire wb_out,
output wire [`NUM_THREADS-1:0][31:0] data_out,
// handshake
input wire valid_in,
output wire ready_in,
output wire valid_out,
input wire ready_out
);
wire is_div_op = `INST_MUL_IS_DIV(alu_op);
wire [`NUM_THREADS-1:0][31:0] mul_result;
wire [`UUID_BITS-1:0] mul_uuid_out;
wire [`NW_BITS-1:0] mul_wid_out;
wire [`NUM_THREADS-1:0] mul_tmask_out;
wire [31:0] mul_PC_out;
wire [`NR_BITS-1:0] mul_rd_out;
wire mul_wb_out;
wire stall_out;
wire mul_valid_out;
wire mul_valid_in = valid_in && !is_div_op;
wire mul_ready_in = ~stall_out || ~mul_valid_out;
wire is_mulh_in = (alu_op != `INST_MUL_MUL);
wire is_signed_mul_a = (alu_op != `INST_MUL_MULHU);
wire is_signed_mul_b = (alu_op != `INST_MUL_MULHU && alu_op != `INST_MUL_MULHSU);
`ifdef IMUL_DPI
wire [`NUM_THREADS-1:0][31:0] mul_result_tmp;
wire mul_fire_in = mul_valid_in && mul_ready_in;
for (genvar i = 0; i < `NUM_THREADS; i++) begin
wire [31:0] mul_resultl, mul_resulth;
always @(*) begin
dpi_imul (mul_fire_in, alu_in1[i], alu_in2[i], is_signed_mul_a, is_signed_mul_b, mul_resultl, mul_resulth);
end
assign mul_result_tmp[i] = is_mulh_in ? mul_resulth : mul_resultl;
end
VX_shift_register #(
.DATAW (1 + `UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `NR_BITS + 1 + (`NUM_THREADS * 32)),
.DEPTH (`LATENCY_IMUL),
.RESETW (1)
) mul_shift_reg (
.clk(clk),
.reset (reset),
.enable (mul_ready_in),
.data_in ({mul_valid_in, uuid_in, wid_in, tmask_in, PC_in, rd_in, wb_in, mul_result_tmp}),
.data_out ({mul_valid_out, mul_uuid_out, mul_wid_out, mul_tmask_out, mul_PC_out, mul_rd_out, mul_wb_out, mul_result})
);
`else
wire is_mulh_out;
for (genvar i = 0; i < `NUM_THREADS; i++) begin
wire [32:0] mul_in1 = {is_signed_mul_a & alu_in1[i][31], alu_in1[i]};
wire [32:0] mul_in2 = {is_signed_mul_b & alu_in2[i][31], alu_in2[i]};
`IGNORE_UNUSED_BEGIN
wire [65:0] mul_result_tmp;
`IGNORE_UNUSED_END
VX_multiplier #(
.WIDTHA (33),
.WIDTHB (33),
.WIDTHP (66),
.SIGNED (1),
.LATENCY (`LATENCY_IMUL)
) multiplier (
.clk (clk),
.enable (mul_ready_in),
.dataa (mul_in1),
.datab (mul_in2),
.result (mul_result_tmp)
);
assign mul_result[i] = is_mulh_out ? mul_result_tmp[63:32] : mul_result_tmp[31:0];
end
VX_shift_register #(
.DATAW (1 + `UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `NR_BITS + 1 + 1),
.DEPTH (`LATENCY_IMUL),
.RESETW (1)
) mul_shift_reg (
.clk(clk),
.reset (reset),
.enable (mul_ready_in),
.data_in ({mul_valid_in, uuid_in, wid_in, tmask_in, PC_in, rd_in, wb_in, is_mulh_in}),
.data_out ({mul_valid_out, mul_uuid_out, mul_wid_out, mul_tmask_out, mul_PC_out, mul_rd_out, mul_wb_out, is_mulh_out})
);
`endif
///////////////////////////////////////////////////////////////////////////
wire [`NUM_THREADS-1:0][31:0] div_result;
wire [`UUID_BITS-1:0] div_uuid_out;
wire [`NW_BITS-1:0] div_wid_out;
wire [`NUM_THREADS-1:0] div_tmask_out;
wire [31:0] div_PC_out;
wire [`NR_BITS-1:0] div_rd_out;
wire div_wb_out;
wire is_rem_op_in = (alu_op == `INST_MUL_REM) || (alu_op == `INST_MUL_REMU);
wire is_signed_div = (alu_op == `INST_MUL_DIV) || (alu_op == `INST_MUL_REM);
wire div_valid_in = valid_in && is_div_op;
wire div_ready_out = ~stall_out && ~mul_valid_out; // arbitration prioritizes MUL
wire div_ready_in;
wire div_valid_out;
`ifdef IDIV_DPI
wire [`NUM_THREADS-1:0][31:0] div_result_tmp;
wire div_fire_in = div_valid_in && div_ready_in;
for (genvar i = 0; i < `NUM_THREADS; i++) begin
wire [31:0] div_quotient, div_remainder;
always @(*) begin
dpi_idiv (div_fire_in, alu_in1[i], alu_in2[i], is_signed_div, div_quotient, div_remainder);
end
assign div_result_tmp[i] = is_rem_op_in ? div_remainder : div_quotient;
end
VX_shift_register #(
.DATAW (1 + `UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `NR_BITS + 1 + (`NUM_THREADS * 32)),
.DEPTH (`LATENCY_IMUL),
.RESETW (1)
) div_shift_reg (
.clk(clk),
.reset (reset),
.enable (div_ready_in),
.data_in ({div_valid_in, uuid_in, wid_in, tmask_in, PC_in, rd_in, wb_in, div_result_tmp}),
.data_out ({div_valid_out, div_uuid_out, div_wid_out, div_tmask_out, div_PC_out, div_rd_out, div_wb_out, div_result})
);
assign div_ready_in = div_ready_out || ~div_valid_out;
`else
wire [`NUM_THREADS-1:0][31:0] div_result_tmp, rem_result_tmp;
wire is_rem_op_out;
VX_serial_div #(
.WIDTHN (32),
.WIDTHD (32),
.WIDTHQ (32),
.WIDTHR (32),
.LANES (`NUM_THREADS),
.TAGW (64 + `NW_BITS + `NUM_THREADS + 32 + `NR_BITS + 1 + 1)
) divide (
.clk (clk),
.reset (reset),
.valid_in (div_valid_in),
.ready_in (div_ready_in),
.signed_mode(is_signed_div),
.tag_in ({uuid_in, wid_in, tmask_in, PC_in, rd_in, wb_in, is_rem_op_in}),
.numer (alu_in1),
.denom (alu_in2),
.quotient (div_result_tmp),
.remainder (rem_result_tmp),
.ready_out (div_ready_out),
.valid_out (div_valid_out),
.tag_out ({div_uuid_out, div_wid_out, div_tmask_out, div_PC_out, div_rd_out, div_wb_out, is_rem_op_out})
);
assign div_result = is_rem_op_out ? rem_result_tmp : div_result_tmp;
`endif
///////////////////////////////////////////////////////////////////////////
wire rsp_valid = mul_valid_out || div_valid_out;
wire [`UUID_BITS-1:0] rsp_uuid = mul_valid_out ? mul_uuid_out : div_uuid_out;
wire [`NW_BITS-1:0] rsp_wid = mul_valid_out ? mul_wid_out : div_wid_out;
wire [`NUM_THREADS-1:0] rsp_tmask = mul_valid_out ? mul_tmask_out : div_tmask_out;
wire [31:0] rsp_PC = mul_valid_out ? mul_PC_out : div_PC_out;
wire [`NR_BITS-1:0] rsp_rd = mul_valid_out ? mul_rd_out : div_rd_out;
wire rsp_wb = mul_valid_out ? mul_wb_out : div_wb_out;
wire [`NUM_THREADS-1:0][31:0] rsp_data = mul_valid_out ? mul_result : div_result;
assign stall_out = ~ready_out && valid_out;
VX_pipe_register #(
.DATAW (1 + `UUID_BITS + `NW_BITS + `NUM_THREADS + 32 + `NR_BITS + 1 + (`NUM_THREADS * 32)),
.RESETW (1)
) pipe_reg (
.clk (clk),
.reset (reset),
.enable (~stall_out),
.data_in ({rsp_valid, rsp_uuid, rsp_wid, rsp_tmask, rsp_PC, rsp_rd, rsp_wb, rsp_data}),
.data_out ({valid_out, uuid_out, wid_out, tmask_out, PC_out, rd_out, wb_out, data_out})
);
// can accept new request?
assign ready_in = is_div_op ? div_ready_in : mul_ready_in;
endmodule

View File

@@ -1,261 +0,0 @@
`include "VX_define.vh"
module VX_pipeline #(
parameter CORE_ID = 0
) (
`SCOPE_IO_VX_pipeline
// Clock
input wire clk,
input wire reset,
// Dcache core request
output wire [`NUM_THREADS-1:0] dcache_req_valid,
output wire [`NUM_THREADS-1:0] dcache_req_rw,
output wire [`NUM_THREADS-1:0][3:0] dcache_req_byteen,
output wire [`NUM_THREADS-1:0][29:0] dcache_req_addr,
output wire [`NUM_THREADS-1:0][31:0] dcache_req_data,
output wire [`NUM_THREADS-1:0][`DCACHE_CORE_TAG_WIDTH-1:0] dcache_req_tag,
input wire [`NUM_THREADS-1:0] dcache_req_ready,
// Dcache core reponse
input wire dcache_rsp_valid,
input wire [`NUM_THREADS-1:0] dcache_rsp_tmask,
input wire [`NUM_THREADS-1:0][31:0] dcache_rsp_data,
input wire [`DCACHE_CORE_TAG_WIDTH-1:0] dcache_rsp_tag,
output wire dcache_rsp_ready,
// Icache core request
output wire icache_req_valid,
output wire [29:0] icache_req_addr,
output wire [`ICACHE_CORE_TAG_WIDTH-1:0] icache_req_tag,
input wire icache_req_ready,
// Icache core response
input wire icache_rsp_valid,
input wire [31:0] icache_rsp_data,
input wire [`ICACHE_CORE_TAG_WIDTH-1:0] icache_rsp_tag,
output wire icache_rsp_ready,
`ifdef PERF_ENABLE
VX_perf_memsys_if.slave perf_memsys_if,
`endif
// Status
output wire busy
);
//
// Dcache request
//
VX_dcache_req_if #(
.NUM_REQS (`NUM_THREADS),
.WORD_SIZE (4),
.TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH)
) dcache_req_if();
assign dcache_req_valid = dcache_req_if.valid;
assign dcache_req_rw = dcache_req_if.rw;
assign dcache_req_byteen = dcache_req_if.byteen;
assign dcache_req_addr = dcache_req_if.addr;
assign dcache_req_data = dcache_req_if.data;
assign dcache_req_tag = dcache_req_if.tag;
assign dcache_req_if.ready = dcache_req_ready;
//
// Dcache response
//
VX_dcache_rsp_if #(
.NUM_REQS (`NUM_THREADS),
.WORD_SIZE (4),
.TAG_WIDTH (`DCACHE_CORE_TAG_WIDTH)
) dcache_rsp_if();
assign dcache_rsp_if.valid = dcache_rsp_valid;
assign dcache_rsp_if.tmask = dcache_rsp_tmask;
assign dcache_rsp_if.data = dcache_rsp_data;
assign dcache_rsp_if.tag = dcache_rsp_tag;
assign dcache_rsp_ready = dcache_rsp_if.ready;
//
// Icache request
//
VX_icache_req_if #(
.WORD_SIZE (4),
.TAG_WIDTH (`ICACHE_CORE_TAG_WIDTH)
) icache_req_if();
assign icache_req_valid = icache_req_if.valid;
assign icache_req_addr = icache_req_if.addr;
assign icache_req_tag = icache_req_if.tag;
assign icache_req_if.ready = icache_req_ready;
//
// Icache response
//
VX_icache_rsp_if #(
.WORD_SIZE (4),
.TAG_WIDTH (`ICACHE_CORE_TAG_WIDTH)
) icache_rsp_if();
assign icache_rsp_if.valid = icache_rsp_valid;
assign icache_rsp_if.data = icache_rsp_data;
assign icache_rsp_if.tag = icache_rsp_tag;
assign icache_rsp_ready = icache_rsp_if.ready;
///////////////////////////////////////////////////////////////////////////
VX_fetch_to_csr_if fetch_to_csr_if();
VX_cmt_to_csr_if cmt_to_csr_if();
VX_decode_if decode_if();
VX_branch_ctl_if branch_ctl_if();
VX_warp_ctl_if warp_ctl_if();
VX_ifetch_rsp_if ifetch_rsp_if();
VX_alu_req_if alu_req_if();
VX_lsu_req_if lsu_req_if();
VX_csr_req_if csr_req_if();
`ifdef EXT_F_ENABLE
VX_fpu_req_if fpu_req_if();
`endif
VX_gpu_req_if gpu_req_if();
VX_writeback_if writeback_if();
VX_wstall_if wstall_if();
VX_join_if join_if();
VX_commit_if alu_commit_if();
VX_commit_if ld_commit_if();
VX_commit_if st_commit_if();
VX_commit_if csr_commit_if();
`ifdef EXT_F_ENABLE
VX_commit_if fpu_commit_if();
`endif
VX_commit_if gpu_commit_if();
`ifdef PERF_ENABLE
VX_perf_pipeline_if perf_pipeline_if();
`endif
`RESET_RELAY (fetch_reset);
`RESET_RELAY (decode_reset);
`RESET_RELAY (issue_reset);
`RESET_RELAY (execute_reset);
`RESET_RELAY (commit_reset);
VX_fetch #(
.CORE_ID(CORE_ID)
) fetch (
`SCOPE_BIND_VX_pipeline_fetch
.clk (clk),
.reset (fetch_reset),
.icache_req_if (icache_req_if),
.icache_rsp_if (icache_rsp_if),
.wstall_if (wstall_if),
.join_if (join_if),
.warp_ctl_if (warp_ctl_if),
.branch_ctl_if (branch_ctl_if),
.ifetch_rsp_if (ifetch_rsp_if),
.fetch_to_csr_if(fetch_to_csr_if),
.busy (busy)
);
VX_decode #(
.CORE_ID(CORE_ID)
) decode (
.clk (clk),
.reset (decode_reset),
`ifdef PERF_ENABLE
.perf_decode_if (perf_pipeline_if.decode),
`endif
.ifetch_rsp_if (ifetch_rsp_if),
.decode_if (decode_if),
.wstall_if (wstall_if),
.join_if (join_if)
);
VX_issue #(
.CORE_ID(CORE_ID)
) issue (
`SCOPE_BIND_VX_pipeline_issue
.clk (clk),
.reset (issue_reset),
`ifdef PERF_ENABLE
.perf_issue_if (perf_pipeline_if.issue),
`endif
.decode_if (decode_if),
.writeback_if (writeback_if),
.alu_req_if (alu_req_if),
.lsu_req_if (lsu_req_if),
.csr_req_if (csr_req_if),
`ifdef EXT_F_ENABLE
.fpu_req_if (fpu_req_if),
`endif
.gpu_req_if (gpu_req_if)
);
VX_execute #(
.CORE_ID(CORE_ID)
) execute (
`SCOPE_BIND_VX_pipeline_execute
.clk (clk),
.reset (execute_reset),
`ifdef PERF_ENABLE
.perf_memsys_if (perf_memsys_if),
.perf_pipeline_if (perf_pipeline_if),
`endif
.dcache_req_if (dcache_req_if),
.dcache_rsp_if (dcache_rsp_if),
.cmt_to_csr_if (cmt_to_csr_if),
.fetch_to_csr_if(fetch_to_csr_if),
.alu_req_if (alu_req_if),
.lsu_req_if (lsu_req_if),
.csr_req_if (csr_req_if),
`ifdef EXT_F_ENABLE
.fpu_req_if (fpu_req_if),
`endif
.gpu_req_if (gpu_req_if),
.warp_ctl_if (warp_ctl_if),
.branch_ctl_if (branch_ctl_if),
.alu_commit_if (alu_commit_if),
.ld_commit_if (ld_commit_if),
.st_commit_if (st_commit_if),
.csr_commit_if (csr_commit_if),
`ifdef EXT_F_ENABLE
.fpu_commit_if (fpu_commit_if),
`endif
.gpu_commit_if (gpu_commit_if),
.busy (busy)
);
VX_commit #(
.CORE_ID(CORE_ID)
) commit (
.clk (clk),
.reset (commit_reset),
.alu_commit_if (alu_commit_if),
.ld_commit_if (ld_commit_if),
.st_commit_if (st_commit_if),
.csr_commit_if (csr_commit_if),
`ifdef EXT_F_ENABLE
.fpu_commit_if (fpu_commit_if),
`endif
.gpu_commit_if (gpu_commit_if),
.writeback_if (writeback_if),
.cmt_to_csr_if (cmt_to_csr_if)
);
endmodule

View File

@@ -1,7 +1,47 @@
`ifndef VX_PLATFORM
`define VX_PLATFORM
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
`ifndef SYNTHESIS
`ifndef VX_PLATFORM_VH
`define VX_PLATFORM_VH
// enable synthesizable build by default if not SIMULATION
`ifndef SIMULATION
`define SYNTHESIS
`define NDEBUG
`define DPI_DISABLE
`else // !SYNTHESIS
`define SV_DPI
`endif
// chipyard-specific configs
`define GPR_RESET
`define GPR_DUPLICATED
`define GBAR_ENABLE
`define GBAR_CLUSTER_ENABLE
`define ICACHE_DISABLE
`define DCACHE_DISABLE
`ifdef SYNTHESIS
`define NUM_BARRIERS 8
`define NUM_CORES 4
`define NUM_THREADS 8
`define NUM_WARPS 8
`define FPU_FPNEW
// `define FIRESIM
`endif // SYNTHESIS
`ifdef SV_DPI
`include "util_dpi.vh"
`endif
@@ -9,8 +49,42 @@
///////////////////////////////////////////////////////////////////////////////
`ifndef SYNTHESIS
`ifdef VIVADO
`define STRING
`else
`ifdef SYNTHESIS
`define STRING
`else
`define STRING string
`endif
`endif
`ifdef SYNTHESIS
`define TRACE(level, args) $write args
`define TRACE_STARTTIME 32'd10
`define TRACING_ON
`define TRACING_OFF
`ifndef NDEBUG
`define DEBUG_BLOCK(x) x
`else
`define DEBUG_BLOCK(x)
`endif
`define IGNORE_UNOPTFLAT_BEGIN
`define IGNORE_UNOPTFLAT_END
`define IGNORE_UNUSED_BEGIN
`define IGNORE_UNUSED_END
`define IGNORE_WARNINGS_BEGIN
`define IGNORE_WARNINGS_END
`define UNUSED_PARAM(x)
`define UNUSED_SPARAM(x)
`define UNUSED_VAR(x)
`define UNUSED_PIN(x) . x ()
`define UNUSED_ARG(x) x
`else // !SYNTHESIS
`ifdef VERILATOR
`define SIMULATION
`define TRACING_ON /* verilator tracing_on */
`define TRACING_OFF /* verilator tracing_off */
`ifndef NDEBUG
`define DEBUG_BLOCK(x) /* verilator lint_off UNUSED */ \
x \
@@ -19,6 +93,10 @@
`define DEBUG_BLOCK(x)
`endif
`define IGNORE_UNOPTFLAT_BEGIN /* verilator lint_off UNOPTFLAT */
`define IGNORE_UNOPTFLAT_END /* verilator lint_off UNOPTFLAT */
`define IGNORE_UNUSED_BEGIN /* verilator lint_off UNUSED */
`define IGNORE_UNUSED_END /* verilator lint_on UNUSED */
@@ -30,7 +108,9 @@
/* verilator lint_off UNDRIVEN */ \
/* verilator lint_off DECLFILENAME */ \
/* verilator lint_off IMPLICIT */ \
/* verilator lint_off IMPORTSTAR */
/* verilator lint_off PINMISSING */ \
/* verilator lint_off IMPORTSTAR */ \
/* verilator lint_off UNSIGNED */
`define IGNORE_WARNINGS_END /* verilator lint_on UNUSED */ \
/* verilator lint_on PINCONNECTEMPTY */ \
@@ -39,68 +119,151 @@
/* verilator lint_on UNDRIVEN */ \
/* verilator lint_on DECLFILENAME */ \
/* verilator lint_on IMPLICIT */ \
/* verilator lint_on IMPORTSTAR */
/* verilator lint_off PINMISSING */ \
/* verilator lint_on IMPORTSTAR */ \
/* verilator lint_on UNSIGNED */
`define UNUSED_PARAM(x) /* verilator lint_off UNUSED */ \
localparam __``x = x; \
/* verilator lint_on UNUSED */
`define UNUSED_VAR(x) always @(x) begin end
`define UNUSED_SPARAM(x) /* verilator lint_off UNUSED */ \
localparam `STRING __``x = x; \
/* verilator lint_on UNUSED */
`define UNUSED_PIN(x) /* verilator lint_off PINCONNECTEMPTY */ \
. x () \
/* verilator lint_on PINCONNECTEMPTY */
`define UNUSED_VAR(x) if (1) begin \
/* verilator lint_off UNUSED */ \
wire [$bits(x)-1:0] __x = x; \
/* verilator lint_on UNUSED */ \
end
`define ERROR(msg) \
$error msg
`define UNUSED_PIN(x) /* verilator lint_off PINCONNECTEMPTY */ \
. x () \
/* verilator lint_on PINCONNECTEMPTY */
`define UNUSED_ARG(x) /* verilator lint_off UNUSED */ \
x \
/* verilator lint_on UNUSED */
`define TRACE(level, args) dpi_trace(level, $sformatf args)
// squelch spurrious traces at the very first few cycles caused by to reset
// delay
`define TRACE_STARTTIME 32'd10
`endif
// NOTE(hansung): define these macros to be the same as VERILATOR under VCS;
// they will mostly be ignored
`ifdef VCS
`define TRACING_ON /* verilator tracing_on */
`define TRACING_OFF /* verilator tracing_off */
`ifndef NDEBUG
`define DEBUG_BLOCK(x) /* verilator lint_off UNUSED */ \
x \
/* verilator lint_on UNUSED */
`else
`define DEBUG_BLOCK(x)
`endif
`define ASSERT(cond, msg) \
assert(cond) else $error msg
`define IGNORE_UNOPTFLAT_BEGIN /* verilator lint_off UNOPTFLAT */
`define STATIC_ASSERT(cond, msg) \
`define IGNORE_UNOPTFLAT_END /* verilator lint_off UNOPTFLAT */
`define IGNORE_UNUSED_BEGIN /* verilator lint_off UNUSED */
`define IGNORE_UNUSED_END /* verilator lint_on UNUSED */
`define IGNORE_WARNINGS_BEGIN /* verilator lint_off UNUSED */ \
/* verilator lint_off PINCONNECTEMPTY */ \
/* verilator lint_off WIDTH */ \
/* verilator lint_off UNOPTFLAT */ \
/* verilator lint_off UNDRIVEN */ \
/* verilator lint_off DECLFILENAME */ \
/* verilator lint_off IMPLICIT */ \
/* verilator lint_off PINMISSING */ \
/* verilator lint_off IMPORTSTAR */ \
/* verilator lint_off UNSIGNED */
`define IGNORE_WARNINGS_END /* verilator lint_on UNUSED */ \
/* verilator lint_on PINCONNECTEMPTY */ \
/* verilator lint_on WIDTH */ \
/* verilator lint_on UNOPTFLAT */ \
/* verilator lint_on UNDRIVEN */ \
/* verilator lint_on DECLFILENAME */ \
/* verilator lint_on IMPLICIT */ \
/* verilator lint_off PINMISSING */ \
/* verilator lint_on IMPORTSTAR */ \
/* verilator lint_on UNSIGNED */
`define UNUSED_PARAM(x) /* verilator lint_off UNUSED */ \
localparam __``x = x; \
/* verilator lint_on UNUSED */
`define UNUSED_SPARAM(x) /* verilator lint_off UNUSED */ \
localparam `STRING __``x = x; \
/* verilator lint_on UNUSED */
`define UNUSED_VAR(x) if (1) begin \
/* verilator lint_off UNUSED */ \
wire [$bits(x)-1:0] __x = x; \
/* verilator lint_on UNUSED */ \
end
`define UNUSED_PIN(x) /* verilator lint_off PINCONNECTEMPTY */ \
. x () \
/* verilator lint_on PINCONNECTEMPTY */
`define UNUSED_ARG(x) /* verilator lint_off UNUSED */ \
x \
/* verilator lint_on UNUSED */
`define TRACE(level, args) $write args
// squelch spurrious traces at the very first few cycles caused by to reset
// delay
`define TRACE_STARTTIME 32'd10
`endif
`endif
`ifdef SIMULATION
`define STATIC_ASSERT(cond, msg) \
generate \
if (!(cond)) $error msg; \
endgenerate
`define RUNTIME_ASSERT(cond, msg) \
always @(posedge clk) begin \
assert(cond) else $error msg; \
end
`define ERROR(msg) \
$error msg
`define TRACING_ON /* verilator tracing_on */
`define TRACING_OFF /* verilator tracing_off */
`define ASSERT(cond, msg) \
assert(cond) else $error msg
`else // SYNTHESIS
`define DEBUG_BLOCK(x)
`define IGNORE_UNUSED_BEGIN
`define IGNORE_UNUSED_END
`define IGNORE_WARNINGS_BEGIN
`define IGNORE_WARNINGS_END
`define UNUSED_PARAM(x)
`define UNUSED_VAR(x)
`define UNUSED_PIN(x) . x ()
`define ERROR(msg)
`define ASSERT(cond, msg) if (cond);
`define STATIC_ASSERT(cond, msg)
`define RUNTIME_ASSERT(cond, msg)
`define TRACING_ON
`define TRACING_OFF
`endif // SYNTHESIS
`define RUNTIME_ASSERT(cond, msg) \
always @(posedge clk) begin \
assert(cond) else $error msg; \
end
`else
`define STATIC_ASSERT(cond, msg)
`define ERROR(msg) //
`define ASSERT(cond, msg) //
`define RUNTIME_ASSERT(cond, msg)
`endif
///////////////////////////////////////////////////////////////////////////////
`ifdef QUARTUS
`define MAX_FANOUT 4
`define IF_DATA_SIZE(x) $bits(x.data)
`define USE_FAST_BRAM (* ramstyle = "MLAB, no_rw_check" *)
`define NO_RW_RAM_CHECK (* altera_attribute = "-name add_pass_through_logic_to_inferred_rams off" *)
`define DISABLE_BRAM (* ramstyle = "logic" *)
`define PRESERVE_REG (* preserve *)
`define PRESERVE_NET (* preserve *)
`elsif VIVADO
`define MAX_FANOUT 4
`define IF_DATA_SIZE(x) $bits(x.data)
`define USE_FAST_BRAM (* ram_style = "distributed" *)
`define NO_RW_RAM_CHECK (* rw_addr_collision = "no" *)
`define DISABLE_BRAM (* ram_style = "registers" *)
`define PRESERVE_NET (* keep = "true" *)
`else
`define MAX_FANOUT 4
`define IF_DATA_SIZE(x) x.DATA_WIDTH
`define USE_FAST_BRAM
`define NO_RW_RAM_CHECK
`define DISABLE_BRAM
`define PRESERVE_REG
`define PRESERVE_NET
`endif
///////////////////////////////////////////////////////////////////////////////
@@ -112,52 +275,105 @@
`define LOG2UP(x) (((x) > 1) ? $clog2(x) : 1)
`define ISPOW2(x) (((x) != 0) && (0 == ((x) & ((x) - 1))))
`define ABS(x) (($signed(x) < 0) ? (-$signed(x)) : (x));
`define ABS(x) (((x) < 0) ? (-(x)) : (x));
`ifndef MIN
`define MIN(x, y) (((x) < (y)) ? (x) : (y))
`define MAX(x, y) (((x) > (y)) ? (x) : (y))
`endif
`define UP(x) (((x) > 0) ? (x) : 1)
`ifndef MAX
`define MAX(x, y) (((x) > (y)) ? (x) : (y))
`endif
`ifndef CLAMP
`define CLAMP(x, lo, hi) (((x) > (hi)) ? (hi) : (((x) < (lo)) ? (lo) : (x)))
`endif
`ifndef UP
`define UP(x) (((x) != 0) ? (x) : 1)
`endif
`define RTRIM(x, s) x[$bits(x)-1:($bits(x)-s)]
`define LTRIM(x, s) x[s-1:0]
`define TRACE_ARRAY1D(a, m) \
dpi_trace("{"); \
for (integer i = (m-1); i >= 0; --i) begin \
if (i != (m-1)) dpi_trace(", "); \
dpi_trace("0x%0h", a[i]); \
`define TRACE_ARRAY1D(lvl, arr, m) \
`TRACE(lvl, ("{")); \
for (integer __i = (m-1); __i >= 0; --__i) begin \
if (__i != (m-1)) `TRACE(lvl, (", ")); \
`TRACE(lvl, ("0x%0h", arr[__i])); \
end \
dpi_trace("}"); \
`TRACE(lvl, ("}"));
`define TRACE_ARRAY2D(a, m, n) \
dpi_trace("{"); \
for (integer i = n-1; i >= 0; --i) begin \
if (i != (n-1)) dpi_trace(", "); \
dpi_trace("{"); \
for (integer j = (m-1); j >= 0; --j) begin \
if (j != (m-1)) dpi_trace(", "); \
dpi_trace("0x%0h", a[i][j]); \
`define TRACE_ARRAY2D(lvl, arr, m, n) \
`TRACE(lvl, ("{")); \
for (integer __i = n-1; __i >= 0; --__i) begin \
if (__i != (n-1)) `TRACE(lvl, (", ")); \
`TRACE(lvl, ("{")); \
for (integer __j = (m-1); __j >= 0; --__j) begin \
if (__j != (m-1)) `TRACE(lvl, (", "));\
`TRACE(lvl, ("0x%0h", arr[__i][__j])); \
end \
dpi_trace("}"); \
`TRACE(lvl, ("}")); \
end \
dpi_trace("}")
`TRACE(lvl, ("}"))
`define RESET_RELAY(signal) \
wire signal; \
VX_reset_relay __``signal ( \
.clk (clk), \
.reset (reset), \
.reset_o (signal) \
`define RESET_RELAY_EX(dst, src, size, fanout) \
wire [size-1:0] dst; \
VX_reset_relay #(.N(size), .MAX_FANOUT(fanout)) __``dst ( \
.clk (clk), \
.reset (src), \
.reset_o (dst) \
)
`define POP_COUNT(out, in) \
VX_popcount #( \
.N ($bits(in)) \
) __``out ( \
.in_i (in), \
.cnt_o (out) \
)
`define RESET_RELAY_EN(dst, src, enable) \
`RESET_RELAY_EX (dst, src, 1, ((enable) ? 0 : -1))
`endif
`define RESET_RELAY(dst, src) \
`RESET_RELAY_EX (dst, src, 1, 0)
// size(x): 0 -> 0, 1 -> 1, 2 -> 2, 3 -> 2, 4-> 2
`define OUT_REG_TO_EB_SIZE(out_reg) `MIN(out_reg, 2)
// reg(x): 0 -> 0, 1 -> 1, 2 -> 0, 3 -> 1, 4 -> 2
`define OUT_REG_TO_EB_REG(out_reg) ((out_reg & 1) + ((out_reg >> 2) << 1))
`define REPEAT(n,f,s) `_REPEAT_``n(f,s)
`define _REPEAT_0(f,s)
`define _REPEAT_1(f,s) `f(0)
`define _REPEAT_2(f,s) `f(1) `s `_REPEAT_1(f,s)
`define _REPEAT_3(f,s) `f(2) `s `_REPEAT_2(f,s)
`define _REPEAT_4(f,s) `f(3) `s `_REPEAT_3(f,s)
`define _REPEAT_5(f,s) `f(4) `s `_REPEAT_4(f,s)
`define _REPEAT_6(f,s) `f(5) `s `_REPEAT_5(f,s)
`define _REPEAT_7(f,s) `f(6) `s `_REPEAT_6(f,s)
`define _REPEAT_8(f,s) `f(7) `s `_REPEAT_7(f,s)
`define _REPEAT_9(f,s) `f(8) `s `_REPEAT_8(f,s)
`define _REPEAT_10(f,s) `f(9) `s `_REPEAT_9(f,s)
`define _REPEAT_11(f,s) `f(10) `s `_REPEAT_10(f,s)
`define _REPEAT_12(f,s) `f(11) `s `_REPEAT_11(f,s)
`define _REPEAT_13(f,s) `f(12) `s `_REPEAT_12(f,s)
`define _REPEAT_14(f,s) `f(13) `s `_REPEAT_13(f,s)
`define _REPEAT_15(f,s) `f(14) `s `_REPEAT_14(f,s)
`define _REPEAT_16(f,s) `f(15) `s `_REPEAT_15(f,s)
`define _REPEAT_17(f,s) `f(16) `s `_REPEAT_16(f,s)
`define _REPEAT_18(f,s) `f(17) `s `_REPEAT_17(f,s)
`define _REPEAT_19(f,s) `f(18) `s `_REPEAT_18(f,s)
`define _REPEAT_20(f,s) `f(19) `s `_REPEAT_19(f,s)
`define _REPEAT_21(f,s) `f(20) `s `_REPEAT_20(f,s)
`define _REPEAT_22(f,s) `f(21) `s `_REPEAT_21(f,s)
`define _REPEAT_23(f,s) `f(22) `s `_REPEAT_22(f,s)
`define _REPEAT_24(f,s) `f(23) `s `_REPEAT_23(f,s)
`define _REPEAT_25(f,s) `f(24) `s `_REPEAT_24(f,s)
`define _REPEAT_26(f,s) `f(25) `s `_REPEAT_25(f,s)
`define _REPEAT_27(f,s) `f(26) `s `_REPEAT_26(f,s)
`define _REPEAT_28(f,s) `f(27) `s `_REPEAT_27(f,s)
`define _REPEAT_29(f,s) `f(28) `s `_REPEAT_28(f,s)
`define _REPEAT_30(f,s) `f(29) `s `_REPEAT_29(f,s)
`define _REPEAT_31(f,s) `f(30) `s `_REPEAT_30(f,s)
`define _REPEAT_32(f,s) `f(31) `s `_REPEAT_31(f,s)
`define REPEAT_COMMA ,
`define REPEAT_SEMICOLON ;
`endif // VX_PLATFORM_VH

View File

@@ -1,89 +1,68 @@
`ifndef VX_SCOPE
`define VX_SCOPE
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
`ifndef VX_SCOPE_VH
`define VX_SCOPE_VH
`ifdef SCOPE
`include "scope-defs.vh"
`define SCOPE_IO_DECL \
input wire scope_reset, \
input wire scope_bus_in, \
output wire scope_bus_out,
`define SCOPE_ASSIGN(d,s) assign scope_``d = s
`define SCOPE_IO_SWITCH(__count) \
wire scope_bus_in_w [__count]; \
wire scope_bus_out_w [__count]; \
`RESET_RELAY_EX(scope_reset_w, scope_reset, __count, 4); \
VX_scope_switch #( \
.N (__count) \
) scope_switch ( \
.clk (clk), \
.reset (scope_reset), \
.req_in (scope_bus_in), \
.rsp_out (scope_bus_out), \
.req_out (scope_bus_in_w), \
.rsp_in (scope_bus_out_w) \
);
`define SCOPE_SIZE 1024
`define SCOPE_IO_BIND(__i) \
.scope_reset (scope_reset_w[__i]), \
.scope_bus_in (scope_bus_in_w[__i]), \
.scope_bus_out (scope_bus_out_w[__i]),
`define SCOPE_IO_UNUSED() \
`UNUSED_VAR (scope_reset); \
`UNUSED_VAR (scope_bus_in); \
assign scope_bus_out = 0;
`define SCOPE_IO_UNUSED_W(__i) \
`UNUSED_VAR (scope_reset_w[__i]); \
`UNUSED_VAR (scope_bus_in_w[__i]); \
assign scope_bus_out_w[__i] = 0;
`else
`define SCOPE_IO_VX_icache_stage
`define SCOPE_IO_DECL
`define SCOPE_IO_VX_fetch
`define SCOPE_IO_SWITCH(__count)
`define SCOPE_BIND_VX_fetch_icache_stage
`define SCOPE_IO_BIND(__i)
`define SCOPE_BIND_VX_fetch_warp_sched
`define SCOPE_IO_UNUSED_W(__i)
`define SCOPE_IO_VX_warp_sched
`define SCOPE_IO_VX_pipeline
`define SCOPE_BIND_VX_pipeline_fetch
`define SCOPE_IO_VX_core
`define SCOPE_BIND_VX_core_pipeline
`define SCOPE_IO_VX_cluster
`define SCOPE_BIND_VX_cluster_core(__i__)
`define SCOPE_IO_Vortex
`define SCOPE_BIND_Vortex_cluster(__i__)
`define SCOPE_BIND_afu_vortex
`define SCOPE_IO_VX_lsu_unit
`define SCOPE_IO_VX_gpu_unit
`define SCOPE_IO_VX_execute
`define SCOPE_BIND_VX_execute_lsu_unit
`define SCOPE_BIND_VX_execute_gpu_unit
`define SCOPE_BIND_VX_pipeline_execute
`define SCOPE_IO_VX_issue
`define SCOPE_BIND_VX_pipeline_issue
`define SCOPE_IO_VX_bank
`define SCOPE_IO_VX_cache
`define SCOPE_BIND_VX_cache_bank(__i__)
`define SCOPE_BIND_Vortex_l3cache
`define SCOPE_BIND_VX_cluster_l2cache
`define SCOPE_IO_VX_mem_unit
`define SCOPE_BIND_VX_mem_unit_dcache
`define SCOPE_BIND_VX_core_mem_unit
`define SCOPE_BIND_VX_mem_unit_icache
`define SCOPE_BIND_VX_mem_unit_smem
`define SCOPE_DECL_SIGNALS
`define SCOPE_DATA_LIST
`define SCOPE_UPDATE_LIST
`define SCOPE_TRIGGER
`define SCOPE_ASSIGN(d,s)
`define SCOPE_IO_UNUSED(__i)
`endif
`endif
`endif // VX_SCOPE_VH

View File

@@ -1,85 +0,0 @@
`include "VX_define.vh"
module VX_scoreboard #(
parameter CORE_ID = 0
) (
input wire clk,
input wire reset,
VX_ibuffer_if.slave ibuffer_if,
VX_writeback_if.slave writeback_if
);
reg [`NUM_WARPS-1:0][`NUM_REGS-1:0] inuse_regs, inuse_regs_n;
wire reserve_reg = ibuffer_if.valid && ibuffer_if.ready && ibuffer_if.wb;
wire release_reg = writeback_if.valid && writeback_if.ready && writeback_if.eop;
always @(*) begin
inuse_regs_n = inuse_regs;
if (reserve_reg) begin
inuse_regs_n[ibuffer_if.wid][ibuffer_if.rd] = 1;
end
if (release_reg) begin
inuse_regs_n[writeback_if.wid][writeback_if.rd] = 0;
end
end
always @(posedge clk) begin
if (reset) begin
inuse_regs <= '0;
end else begin
inuse_regs <= inuse_regs_n;
end
end
reg deq_inuse_rd, deq_inuse_rs1, deq_inuse_rs2, deq_inuse_rs3;
always @(posedge clk) begin
deq_inuse_rd <= inuse_regs_n[ibuffer_if.wid_n][ibuffer_if.rd_n];
deq_inuse_rs1 <= inuse_regs_n[ibuffer_if.wid_n][ibuffer_if.rs1_n];
deq_inuse_rs2 <= inuse_regs_n[ibuffer_if.wid_n][ibuffer_if.rs2_n];
deq_inuse_rs3 <= inuse_regs_n[ibuffer_if.wid_n][ibuffer_if.rs3_n];
end
assign writeback_if.ready = 1'b1;
assign ibuffer_if.ready = ~(deq_inuse_rd
| deq_inuse_rs1
| deq_inuse_rs2
| deq_inuse_rs3);
`UNUSED_VAR (writeback_if.PC)
reg [31:0] deadlock_ctr;
wire [31:0] deadlock_timeout = 10000 * (1 ** (`L2_ENABLE + `L3_ENABLE));
always @(posedge clk) begin
if (reset) begin
deadlock_ctr <= 0;
end else begin
`ifdef DBG_TRACE_CORE_PIPELINE
if (ibuffer_if.valid && ~ibuffer_if.ready) begin
dpi_trace("%d: *** core%0d-stall: wid=%0d, PC=%0h, rd=%0d, wb=%0d, inuse=%b%b%b%b (#%0d)\n",
$time, CORE_ID, ibuffer_if.wid, ibuffer_if.PC, ibuffer_if.rd, ibuffer_if.wb,
deq_inuse_rd, deq_inuse_rs1, deq_inuse_rs2, deq_inuse_rs3, ibuffer_if.uuid);
end
`endif
if (release_reg) begin
`ASSERT(inuse_regs[writeback_if.wid][writeback_if.rd] != 0,
("%t: *** core%0d: invalid writeback register: wid=%0d, PC=%0h, rd=%0d (#%0d)",
$time, CORE_ID, writeback_if.wid, writeback_if.PC, writeback_if.rd,writeback_if.uuid));
end
if (ibuffer_if.valid && ~ibuffer_if.ready) begin
deadlock_ctr <= deadlock_ctr + 1;
`ASSERT(deadlock_ctr < deadlock_timeout,
("%t: *** core%0d-deadlock: wid=%0d, PC=%0h, rd=%0d, wb=%0d, inuse=%b%b%b%b (#%0d)",
$time, CORE_ID, ibuffer_if.wid, ibuffer_if.PC, ibuffer_if.rd, ibuffer_if.wb,
deq_inuse_rd, deq_inuse_rs1, deq_inuse_rs2, deq_inuse_rs3, ibuffer_if.uuid));
end else if (ibuffer_if.valid && ibuffer_if.ready) begin
deadlock_ctr <= 0;
end
end
end
endmodule

View File

@@ -1,160 +0,0 @@
`include "VX_define.vh"
module VX_smem_arb #(
parameter NUM_REQS = 1,
parameter LANES = 1,
parameter DATA_SIZE = 1,
parameter TAG_IN_WIDTH = 1,
parameter TAG_SEL_IDX = 0,
parameter BUFFERED_REQ = 0,
parameter BUFFERED_RSP = 0,
parameter TYPE = "P",
parameter ADDR_WIDTH = (32-`CLOG2(DATA_SIZE)),
parameter DATA_WIDTH = (8 * DATA_SIZE),
parameter LOG_NUM_REQS = `CLOG2(NUM_REQS),
parameter TAG_OUT_WIDTH = TAG_IN_WIDTH - LOG_NUM_REQS
) (
input wire clk,
input wire reset,
// input request
input wire [LANES-1:0] req_valid_in,
input wire [LANES-1:0] req_rw_in,
input wire [LANES-1:0][DATA_SIZE-1:0] req_byteen_in,
input wire [LANES-1:0][ADDR_WIDTH-1:0] req_addr_in,
input wire [LANES-1:0][DATA_WIDTH-1:0] req_data_in,
input wire [LANES-1:0][TAG_IN_WIDTH-1:0] req_tag_in,
output wire [LANES-1:0] req_ready_in,
// output requests
output wire [NUM_REQS-1:0][LANES-1:0] req_valid_out,
output wire [NUM_REQS-1:0][LANES-1:0] req_rw_out,
output wire [NUM_REQS-1:0][LANES-1:0][DATA_SIZE-1:0] req_byteen_out,
output wire [NUM_REQS-1:0][LANES-1:0][ADDR_WIDTH-1:0] req_addr_out,
output wire [NUM_REQS-1:0][LANES-1:0][DATA_WIDTH-1:0] req_data_out,
output wire [NUM_REQS-1:0][LANES-1:0][TAG_OUT_WIDTH-1:0] req_tag_out,
input wire [NUM_REQS-1:0][LANES-1:0] req_ready_out,
// input responses
input wire [NUM_REQS-1:0] rsp_valid_in,
input wire [NUM_REQS-1:0][LANES-1:0] rsp_tmask_in,
input wire [NUM_REQS-1:0][LANES-1:0][DATA_WIDTH-1:0] rsp_data_in,
input wire [NUM_REQS-1:0][TAG_OUT_WIDTH-1:0] rsp_tag_in,
output wire [NUM_REQS-1:0] rsp_ready_in,
// output response
output wire rsp_valid_out,
output wire [LANES-1:0] rsp_tmask_out,
output wire [LANES-1:0][DATA_WIDTH-1:0] rsp_data_out,
output wire [TAG_IN_WIDTH-1:0] rsp_tag_out,
input wire rsp_ready_out
);
localparam REQ_DATAW = TAG_OUT_WIDTH + ADDR_WIDTH + 1 + DATA_SIZE + DATA_WIDTH;
localparam RSP_DATAW = LANES * (1 + DATA_WIDTH) + TAG_IN_WIDTH;
if (NUM_REQS > 1) begin
wire [LANES-1:0][REQ_DATAW-1:0] req_data_in_merged;
wire [NUM_REQS-1:0][LANES-1:0][REQ_DATAW-1:0] req_data_out_merged;
wire [LANES-1:0][LOG_NUM_REQS-1:0] req_sel;
wire [LANES-1:0][TAG_OUT_WIDTH-1:0] req_tag_in_w;
for (genvar i = 0; i < LANES; ++i) begin
assign req_sel[i] = req_tag_in[i][TAG_SEL_IDX +: LOG_NUM_REQS];
VX_bits_remove #(
.N (TAG_IN_WIDTH),
.S (LOG_NUM_REQS),
.POS (TAG_SEL_IDX)
) bits_remove (
.data_in (req_tag_in[i]),
.data_out (req_tag_in_w[i])
);
assign req_data_in_merged[i] = {req_tag_in_w[i], req_addr_in[i], req_rw_in[i], req_byteen_in[i], req_data_in[i]};
end
VX_stream_demux #(
.NUM_REQS (NUM_REQS),
.LANES (LANES),
.DATAW (REQ_DATAW),
.BUFFERED (BUFFERED_REQ)
) req_demux (
.clk (clk),
.reset (reset),
.sel_in (req_sel),
.valid_in (req_valid_in),
.data_in (req_data_in_merged),
.ready_in (req_ready_in),
.valid_out (req_valid_out),
.data_out (req_data_out_merged),
.ready_out (req_ready_out)
);
for (genvar i = 0; i < NUM_REQS; i++) begin
for (genvar j = 0; j < LANES; ++j) begin
assign {req_tag_out[i][j], req_addr_out[i][j], req_rw_out[i][j], req_byteen_out[i][j], req_data_out[i][j]} = req_data_out_merged[i][j];
end
end
///////////////////////////////////////////////////////////////////////
wire [NUM_REQS-1:0][RSP_DATAW-1:0] rsp_data_in_merged;
for (genvar i = 0; i < NUM_REQS; i++) begin
wire [TAG_IN_WIDTH-1:0] rsp_tag_in_w;
VX_bits_insert #(
.N (TAG_OUT_WIDTH),
.S (LOG_NUM_REQS),
.POS (TAG_SEL_IDX)
) bits_insert (
.data_in (rsp_tag_in[i]),
.sel_in (LOG_NUM_REQS'(i)),
.data_out (rsp_tag_in_w)
);
assign rsp_data_in_merged[i] = {rsp_tag_in_w, rsp_tmask_in[i], rsp_data_in[i]};
end
VX_stream_arbiter #(
.NUM_REQS (NUM_REQS),
.LANES (1),
.DATAW (RSP_DATAW),
.BUFFERED (BUFFERED_RSP),
.TYPE (TYPE)
) rsp_arb (
.clk (clk),
.reset (reset),
.valid_in (rsp_valid_in),
.data_in (rsp_data_in_merged),
.ready_in (rsp_ready_in),
.valid_out (rsp_valid_out),
.data_out ({rsp_tag_out, rsp_tmask_out, rsp_data_out}),
.ready_out (rsp_ready_out)
);
end else begin
`UNUSED_VAR (clk)
`UNUSED_VAR (reset)
assign req_valid_out = req_valid_in;
assign req_tag_out = req_tag_in;
assign req_addr_out = req_addr_in;
assign req_rw_out = req_rw_in;
assign req_byteen_out = req_byteen_in;
assign req_data_out = req_data_in;
assign req_ready_in = req_ready_out;
assign rsp_valid_out = rsp_valid_in;
assign rsp_tmask_out = rsp_tmask_in;
assign rsp_tag_out = rsp_tag_in;
assign rsp_data_out = rsp_data_in;
assign rsp_ready_in = rsp_ready_out;
end
endmodule

247
hw/rtl/VX_socket.sv Normal file
View File

@@ -0,0 +1,247 @@
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
`include "VX_define.vh"
module VX_socket import VX_gpu_pkg::*; #(
parameter SOCKET_ID = 0
) (
`SCOPE_IO_DECL
// Clock
input wire clk,
input wire reset,
`ifdef PERF_ENABLE
VX_mem_perf_if.slave mem_perf_if,
`endif
// DCRs
VX_dcr_bus_if.slave dcr_bus_if,
// Memory
VX_mem_bus_if.master mem_bus_if,
`ifdef GBAR_ENABLE
// Barrier
VX_gbar_bus_if.master gbar_bus_if,
`endif
// simulation helper signals
output wire sim_ebreak,
output wire [`NUM_REGS-1:0][`XLEN-1:0] sim_wb_value,
// Status
output wire busy
);
`ifdef GBAR_ENABLE
VX_gbar_bus_if per_core_gbar_bus_if[`SOCKET_SIZE]();
`RESET_RELAY (gbar_arb_reset, reset);
VX_gbar_arb #(
.NUM_REQS (`SOCKET_SIZE),
.OUT_REG ((`SOCKET_SIZE > 1) ? 2 : 0)
) gbar_arb (
.clk (clk),
.reset (gbar_arb_reset),
.bus_in_if (per_core_gbar_bus_if),
.bus_out_if (gbar_bus_if)
);
`endif
///////////////////////////////////////////////////////////////////////////
`ifdef PERF_ENABLE
VX_mem_perf_if mem_perf_tmp_if();
assign mem_perf_tmp_if.l2cache = mem_perf_if.l2cache;
assign mem_perf_tmp_if.l3cache = mem_perf_if.l3cache;
assign mem_perf_tmp_if.smem = 'x;
assign mem_perf_tmp_if.mem = mem_perf_if.mem;
`endif
///////////////////////////////////////////////////////////////////////////
VX_mem_bus_if #(
.DATA_SIZE (ICACHE_WORD_SIZE),
.TAG_WIDTH (ICACHE_TAG_WIDTH)
) per_core_icache_bus_if[`SOCKET_SIZE]();
VX_mem_bus_if #(
.DATA_SIZE (ICACHE_LINE_SIZE),
.TAG_WIDTH (ICACHE_MEM_TAG_WIDTH)
) icache_mem_bus_if();
`RESET_RELAY (icache_reset, reset);
VX_cache_cluster #(
.INSTANCE_ID ($sformatf("socket%0d-icache", SOCKET_ID)),
.NUM_UNITS (`NUM_ICACHES),
.NUM_INPUTS (`SOCKET_SIZE),
.TAG_SEL_IDX (0),
.CACHE_SIZE (`ICACHE_SIZE),
.LINE_SIZE (ICACHE_LINE_SIZE),
.NUM_BANKS (1),
.NUM_WAYS (`ICACHE_NUM_WAYS),
.WORD_SIZE (ICACHE_WORD_SIZE),
.NUM_REQS (1),
.CRSQ_SIZE (`ICACHE_CRSQ_SIZE),
.MSHR_SIZE (`ICACHE_MSHR_SIZE),
.MRSQ_SIZE (`ICACHE_MRSQ_SIZE),
.MREQ_SIZE (`ICACHE_MREQ_SIZE),
.TAG_WIDTH (ICACHE_TAG_WIDTH),
.UUID_WIDTH (`UUID_WIDTH),
.WRITE_ENABLE (0),
.CORE_OUT_REG (2),
.MEM_OUT_REG (2)
) icache (
`ifdef PERF_ENABLE
.cache_perf (mem_perf_tmp_if.icache),
`endif
.clk (clk),
.reset (icache_reset),
.core_bus_if (per_core_icache_bus_if),
.mem_bus_if (icache_mem_bus_if)
);
///////////////////////////////////////////////////////////////////////////
VX_mem_bus_if #(
.DATA_SIZE (DCACHE_WORD_SIZE),
.TAG_WIDTH (DCACHE_NOSM_TAG_WIDTH)
) per_core_dcache_bus_if[`SOCKET_SIZE * DCACHE_NUM_REQS]();
VX_mem_bus_if #(
.DATA_SIZE (DCACHE_LINE_SIZE),
.TAG_WIDTH (DCACHE_MEM_TAG_WIDTH)
) dcache_mem_bus_if();
`RESET_RELAY (dcache_reset, reset);
VX_cache_cluster #(
.INSTANCE_ID ($sformatf("socket%0d-dcache", SOCKET_ID)),
.NUM_UNITS (`NUM_DCACHES),
.NUM_INPUTS (`SOCKET_SIZE),
.TAG_SEL_IDX (1),
.CACHE_SIZE (`DCACHE_SIZE),
.LINE_SIZE (DCACHE_LINE_SIZE),
.NUM_BANKS (`DCACHE_NUM_BANKS),
.NUM_WAYS (`DCACHE_NUM_WAYS),
.WORD_SIZE (DCACHE_WORD_SIZE),
.NUM_REQS (DCACHE_NUM_REQS),
.CRSQ_SIZE (`DCACHE_CRSQ_SIZE),
.MSHR_SIZE (`DCACHE_MSHR_SIZE),
.MRSQ_SIZE (`DCACHE_MRSQ_SIZE),
.MREQ_SIZE (`DCACHE_MREQ_SIZE),
.TAG_WIDTH (DCACHE_NOSM_TAG_WIDTH),
.UUID_WIDTH (`UUID_WIDTH),
.WRITE_ENABLE (1),
.NC_ENABLE (1),
.CORE_OUT_REG (`SM_ENABLED ? 2 : 1),
.MEM_OUT_REG (2)
) dcache (
`ifdef PERF_ENABLE
.cache_perf (mem_perf_tmp_if.dcache),
`endif
.clk (clk),
.reset (dcache_reset),
.core_bus_if (per_core_dcache_bus_if),
.mem_bus_if (dcache_mem_bus_if)
);
///////////////////////////////////////////////////////////////////////////
VX_mem_bus_if #(
.DATA_SIZE (`L1_LINE_SIZE),
.TAG_WIDTH (L1_MEM_TAG_WIDTH)
) l1_mem_bus_if[2]();
VX_mem_bus_if #(
.DATA_SIZE (`L1_LINE_SIZE),
.TAG_WIDTH (L1_MEM_ARB_TAG_WIDTH)
) l1_mem_arb_bus_if[1]();
`ASSIGN_VX_MEM_BUS_IF_X (l1_mem_bus_if[0], icache_mem_bus_if, L1_MEM_TAG_WIDTH, ICACHE_MEM_TAG_WIDTH);
`ASSIGN_VX_MEM_BUS_IF_X (l1_mem_bus_if[1], dcache_mem_bus_if, L1_MEM_TAG_WIDTH, DCACHE_MEM_TAG_WIDTH);
`RESET_RELAY (mem_arb_reset, reset);
VX_mem_arb #(
.NUM_INPUTS (2),
.DATA_SIZE (`L1_LINE_SIZE),
.TAG_WIDTH (L1_MEM_TAG_WIDTH),
.TAG_SEL_IDX (1), // Skip 0 for NC flag
.ARBITER ("R"),
.OUT_REG_REQ (2),
.OUT_REG_RSP (2)
) mem_arb (
.clk (clk),
.reset (mem_arb_reset),
.bus_in_if (l1_mem_bus_if),
.bus_out_if (l1_mem_arb_bus_if)
);
`ASSIGN_VX_MEM_BUS_IF (mem_bus_if, l1_mem_arb_bus_if[0]);
///////////////////////////////////////////////////////////////////////////
wire [`SOCKET_SIZE-1:0] per_core_sim_ebreak;
wire [`SOCKET_SIZE-1:0][`NUM_REGS-1:0][`XLEN-1:0] per_core_sim_wb_value;
assign sim_ebreak = per_core_sim_ebreak[0];
assign sim_wb_value = per_core_sim_wb_value[0];
`UNUSED_VAR (per_core_sim_ebreak)
`UNUSED_VAR (per_core_sim_wb_value)
wire [`SOCKET_SIZE-1:0] per_core_busy;
`BUFFER_DCR_BUS_IF (core_dcr_bus_if, dcr_bus_if, (`SOCKET_SIZE > 1));
`SCOPE_IO_SWITCH (`SOCKET_SIZE)
// Generate all cores
for (genvar i = 0; i < `SOCKET_SIZE; ++i) begin
`RESET_RELAY (core_reset, reset);
VX_core #(
.CORE_ID ((SOCKET_ID * `SOCKET_SIZE) + i)
) core (
`SCOPE_IO_BIND (i)
.clk (clk),
.reset (core_reset),
`ifdef PERF_ENABLE
.mem_perf_if (mem_perf_tmp_if),
`endif
.dcr_bus_if (core_dcr_bus_if),
.dcache_bus_if (per_core_dcache_bus_if[i * DCACHE_NUM_REQS +: DCACHE_NUM_REQS]),
.icache_bus_if (per_core_icache_bus_if[i]),
`ifdef GBAR_ENABLE
.gbar_bus_if (per_core_gbar_bus_if[i]),
`endif
.sim_ebreak (per_core_sim_ebreak[i]),
.sim_wb_value (per_core_sim_wb_value[i]),
.busy (per_core_busy[i])
);
end
`BUFFER_EX(busy, (| per_core_busy), 1'b1, (`SOCKET_SIZE > 1));
endmodule

View File

@@ -1,148 +0,0 @@
`ifndef VX_TRACE_INSTR
`define VX_TRACE_INSTR
`include "VX_define.vh"
task trace_ex_type (
input [`EX_BITS-1:0] ex_type
);
case (ex_type)
`EX_ALU: dpi_trace("ALU");
`EX_LSU: dpi_trace("LSU");
`EX_CSR: dpi_trace("CSR");
`EX_FPU: dpi_trace("FPU");
`EX_GPU: dpi_trace("GPU");
default: dpi_trace("NOP");
endcase
endtask
task trace_ex_op (
input [`EX_BITS-1:0] ex_type,
input [`INST_OP_BITS-1:0] op_type,
input [`INST_MOD_BITS-1:0] op_mod
);
case (ex_type)
`EX_ALU: begin
if (`INST_ALU_IS_BR(op_mod)) begin
case (`INST_BR_BITS'(op_type))
`INST_BR_EQ: dpi_trace("BEQ");
`INST_BR_NE: dpi_trace("BNE");
`INST_BR_LT: dpi_trace("BLT");
`INST_BR_GE: dpi_trace("BGE");
`INST_BR_LTU: dpi_trace("BLTU");
`INST_BR_GEU: dpi_trace("BGEU");
`INST_BR_JAL: dpi_trace("JAL");
`INST_BR_JALR: dpi_trace("JALR");
`INST_BR_ECALL: dpi_trace("ECALL");
`INST_BR_EBREAK:dpi_trace("EBREAK");
`INST_BR_URET: dpi_trace("URET");
`INST_BR_SRET: dpi_trace("SRET");
`INST_BR_MRET: dpi_trace("MRET");
default: dpi_trace("?");
endcase
end else if (`INST_ALU_IS_MUL(op_mod)) begin
case (`INST_MUL_BITS'(op_type))
`INST_MUL_MUL: dpi_trace("MUL");
`INST_MUL_MULH: dpi_trace("MULH");
`INST_MUL_MULHSU:dpi_trace("MULHSU");
`INST_MUL_MULHU: dpi_trace("MULHU");
`INST_MUL_DIV: dpi_trace("DIV");
`INST_MUL_DIVU: dpi_trace("DIVU");
`INST_MUL_REM: dpi_trace("REM");
`INST_MUL_REMU: dpi_trace("REMU");
default: dpi_trace("?");
endcase
end else begin
case (`INST_ALU_BITS'(op_type))
`INST_ALU_ADD: dpi_trace("ADD");
`INST_ALU_SUB: dpi_trace("SUB");
`INST_ALU_SLL: dpi_trace("SLL");
`INST_ALU_SRL: dpi_trace("SRL");
`INST_ALU_SRA: dpi_trace("SRA");
`INST_ALU_SLT: dpi_trace("SLT");
`INST_ALU_SLTU: dpi_trace("SLTU");
`INST_ALU_XOR: dpi_trace("XOR");
`INST_ALU_OR: dpi_trace("OR");
`INST_ALU_AND: dpi_trace("AND");
`INST_ALU_LUI: dpi_trace("LUI");
`INST_ALU_AUIPC: dpi_trace("AUIPC");
default: dpi_trace("?");
endcase
end
end
`EX_LSU: begin
if (op_mod == 0) begin
case (`INST_LSU_BITS'(op_type))
`INST_LSU_LB: dpi_trace("LB");
`INST_LSU_LH: dpi_trace("LH");
`INST_LSU_LW: dpi_trace("LW");
`INST_LSU_LBU:dpi_trace("LBU");
`INST_LSU_LHU:dpi_trace("LHU");
`INST_LSU_SB: dpi_trace("SB");
`INST_LSU_SH: dpi_trace("SH");
`INST_LSU_SW: dpi_trace("SW");
default: dpi_trace("?");
endcase
end else if (op_mod == 1) begin
case (`INST_FENCE_BITS'(op_type))
`INST_FENCE_D: dpi_trace("DFENCE");
`INST_FENCE_I: dpi_trace("IFENCE");
default: dpi_trace("?");
endcase
end
end
`EX_CSR: begin
case (`INST_CSR_BITS'(op_type))
`INST_CSR_RW: dpi_trace("CSRW");
`INST_CSR_RS: dpi_trace("CSRS");
`INST_CSR_RC: dpi_trace("CSRC");
default: dpi_trace("?");
endcase
end
`EX_FPU: begin
case (`INST_FPU_BITS'(op_type))
`INST_FPU_ADD: dpi_trace("ADD");
`INST_FPU_SUB: dpi_trace("SUB");
`INST_FPU_MUL: dpi_trace("MUL");
`INST_FPU_DIV: dpi_trace("DIV");
`INST_FPU_SQRT: dpi_trace("SQRT");
`INST_FPU_MADD: dpi_trace("MADD");
`INST_FPU_NMSUB: dpi_trace("NMSUB");
`INST_FPU_NMADD: dpi_trace("NMADD");
`INST_FPU_CVTWS: dpi_trace("CVTWS");
`INST_FPU_CVTWUS:dpi_trace("CVTWUS");
`INST_FPU_CVTSW: dpi_trace("CVTSW");
`INST_FPU_CVTSWU:dpi_trace("CVTSWU");
`INST_FPU_CLASS: dpi_trace("CLASS");
`INST_FPU_CMP: dpi_trace("CMP");
`INST_FPU_MISC: begin
case (op_mod)
0: dpi_trace("SGNJ");
1: dpi_trace("SGNJN");
2: dpi_trace("SGNJX");
3: dpi_trace("MIN");
4: dpi_trace("MAX");
5: dpi_trace("MVXW");
6: dpi_trace("MVWX");
endcase
end
default: dpi_trace("?");
endcase
end
`EX_GPU: begin
case (`INST_GPU_BITS'(op_type))
`INST_GPU_TMC: dpi_trace("TMC");
`INST_GPU_WSPAWN:dpi_trace("WSPAWN");
`INST_GPU_SPLIT: dpi_trace("SPLIT");
`INST_GPU_JOIN: dpi_trace("JOIN");
`INST_GPU_BAR: dpi_trace("BAR");
`INST_GPU_PRED: dpi_trace("PRED");
`INST_GPU_TEX: dpi_trace("TEX");
default: dpi_trace("?");
endcase
end
default: dpi_trace("?");
endcase
endtask
`endif

191
hw/rtl/VX_types.vh Normal file
View File

@@ -0,0 +1,191 @@
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
`ifndef VX_TYPES_VH
`define VX_TYPES_VH
// Device configuration registers
`define VX_CSR_ADDR_BITS 12
`define VX_DCR_ADDR_BITS 12
`define VX_DCR_BASE_STATE_BEGIN 12'h001
`define VX_DCR_BASE_STARTUP_ADDR0 12'h001
`define VX_DCR_BASE_STARTUP_ADDR1 12'h002
`define VX_DCR_BASE_MPM_CLASS 12'h003
`define VX_DCR_BASE_STATE_END 12'h004
`define VX_DCR_BASE_STATE(addr) ((addr) - `VX_DCR_BASE_STATE_BEGIN)
`define VX_DCR_BASE_STATE_COUNT (`VX_DCR_BASE_STATE_END-`VX_DCR_BASE_STATE_BEGIN)
// Machine Performance-monitoring counters classes
`define VX_DCR_MPM_CLASS_NONE 0
`define VX_DCR_MPM_CLASS_CORE 1
`define VX_DCR_MPM_CLASS_MEM 2
// User Floating-Point CSRs
`define VX_CSR_FFLAGS 12'h001
`define VX_CSR_FRM 12'h002
`define VX_CSR_FCSR 12'h003
`define VX_CSR_SATP 12'h180
`define VX_CSR_PMPCFG0 12'h3A0
`define VX_CSR_PMPADDR0 12'h3B0
`define VX_CSR_MSTATUS 12'h300
`define VX_CSR_MISA 12'h301
`define VX_CSR_MEDELEG 12'h302
`define VX_CSR_MIDELEG 12'h303
`define VX_CSR_MIE 12'h304
`define VX_CSR_MTVEC 12'h305
`define VX_CSR_MEPC 12'h341
`define VX_CSR_MNSTATUS 12'h744
`define VX_CSR_MPM_BASE 12'hB00
`define VX_CSR_MPM_BASE_H 12'hB80
`define VX_CSR_MPM_USER 12'hB03
`define VX_CSR_MPM_USER_H 12'hB83
// Machine Performance-monitoring core counters
// PERF: Standard
`define VX_CSR_MCYCLE 12'hB00
`define VX_CSR_MCYCLE_H 12'hB80
`define VX_CSR_MPM_RESERVED 12'hB01
`define VX_CSR_MPM_RESERVED_H 12'hB81
`define VX_CSR_MINSTRET 12'hB02
`define VX_CSR_MINSTRET_H 12'hB82
// PERF: pipeline
`define VX_CSR_MPM_SCHED_ID 12'hB03
`define VX_CSR_MPM_SCHED_ID_H 12'hB83
`define VX_CSR_MPM_SCHED_ST 12'hB04
`define VX_CSR_MPM_SCHED_ST_H 12'hB84
`define VX_CSR_MPM_IBUF_ST 12'hB05
`define VX_CSR_MPM_IBUF_ST_H 12'hB85
`define VX_CSR_MPM_SCRB_ST 12'hB06
`define VX_CSR_MPM_SCRB_ST_H 12'hB86
`define VX_CSR_MPM_SCRB_ALU 12'hB07
`define VX_CSR_MPM_SCRB_ALU_H 12'hB87
`define VX_CSR_MPM_SCRB_FPU 12'hB08
`define VX_CSR_MPM_SCRB_FPU_H 12'hB88
`define VX_CSR_MPM_SCRB_LSU 12'hB09
`define VX_CSR_MPM_SCRB_LSU_H 12'hB89
`define VX_CSR_MPM_SCRB_SFU 12'hB0A
`define VX_CSR_MPM_SCRB_SFU_H 12'hB8A
// PERF: memory
`define VX_CSR_MPM_IFETCHES 12'hB0B
`define VX_CSR_MPM_IFETCHES_H 12'hB8B
`define VX_CSR_MPM_LOADS 12'hB0C
`define VX_CSR_MPM_LOADS_H 12'hB8C
`define VX_CSR_MPM_STORES 12'hB0D
`define VX_CSR_MPM_STORES_H 12'hB8D
`define VX_CSR_MPM_IFETCH_LT 12'hB0E
`define VX_CSR_MPM_IFETCH_LT_H 12'hB8E
`define VX_CSR_MPM_LOAD_LT 12'hB0F
`define VX_CSR_MPM_LOAD_LT_H 12'hB8F
// SFU: scoreboard
`define VX_CSR_MPM_SCRB_WCTL 12'hB10
`define VX_CSR_MPM_SCRB_WCTL_H 12'hB90
`define VX_CSR_MPM_SCRB_CSRS 12'hB11
`define VX_CSR_MPM_SCRB_CSRS_H 12'hB91
// Machine Performance-monitoring memory counters
// PERF: icache
`define VX_CSR_MPM_ICACHE_READS 12'hB03 // total reads
`define VX_CSR_MPM_ICACHE_READS_H 12'hB83
`define VX_CSR_MPM_ICACHE_MISS_R 12'hB04 // read misses
`define VX_CSR_MPM_ICACHE_MISS_R_H 12'hB84
`define VX_CSR_MPM_ICACHE_MSHR_ST 12'hB05 // MSHR stalls
`define VX_CSR_MPM_ICACHE_MSHR_ST_H 12'hB85
// PERF: dcache
`define VX_CSR_MPM_DCACHE_READS 12'hB06 // total reads
`define VX_CSR_MPM_DCACHE_READS_H 12'hB86
`define VX_CSR_MPM_DCACHE_WRITES 12'hB07 // total writes
`define VX_CSR_MPM_DCACHE_WRITES_H 12'hB87
`define VX_CSR_MPM_DCACHE_MISS_R 12'hB08 // read misses
`define VX_CSR_MPM_DCACHE_MISS_R_H 12'hB88
`define VX_CSR_MPM_DCACHE_MISS_W 12'hB09 // write misses
`define VX_CSR_MPM_DCACHE_MISS_W_H 12'hB89
`define VX_CSR_MPM_DCACHE_BANK_ST 12'hB0A // bank conflicts
`define VX_CSR_MPM_DCACHE_BANK_ST_H 12'hB8A
`define VX_CSR_MPM_DCACHE_MSHR_ST 12'hB0B // MSHR stalls
`define VX_CSR_MPM_DCACHE_MSHR_ST_H 12'hB8B
// PERF: l2cache
`define VX_CSR_MPM_L2CACHE_READS 12'hB0C // total reads
`define VX_CSR_MPM_L2CACHE_READS_H 12'hB8C
`define VX_CSR_MPM_L2CACHE_WRITES 12'hB0D // total writes
`define VX_CSR_MPM_L2CACHE_WRITES_H 12'hB8D
`define VX_CSR_MPM_L2CACHE_MISS_R 12'hB0E // read misses
`define VX_CSR_MPM_L2CACHE_MISS_R_H 12'hB8E
`define VX_CSR_MPM_L2CACHE_MISS_W 12'hB0F // write misses
`define VX_CSR_MPM_L2CACHE_MISS_W_H 12'hB8F
`define VX_CSR_MPM_L2CACHE_BANK_ST 12'hB10 // bank conflicts
`define VX_CSR_MPM_L2CACHE_BANK_ST_H 12'hB90
`define VX_CSR_MPM_L2CACHE_MSHR_ST 12'hB11 // MSHR stalls
`define VX_CSR_MPM_L2CACHE_MSHR_ST_H 12'hB91
// PERF: l3cache
`define VX_CSR_MPM_L3CACHE_READS 12'hB12 // total reads
`define VX_CSR_MPM_L3CACHE_READS_H 12'hB92
`define VX_CSR_MPM_L3CACHE_WRITES 12'hB13 // total writes
`define VX_CSR_MPM_L3CACHE_WRITES_H 12'hB93
`define VX_CSR_MPM_L3CACHE_MISS_R 12'hB14 // read misses
`define VX_CSR_MPM_L3CACHE_MISS_R_H 12'hB94
`define VX_CSR_MPM_L3CACHE_MISS_W 12'hB15 // write misses
`define VX_CSR_MPM_L3CACHE_MISS_W_H 12'hB95
`define VX_CSR_MPM_L3CACHE_BANK_ST 12'hB16 // bank conflicts
`define VX_CSR_MPM_L3CACHE_BANK_ST_H 12'hB96
`define VX_CSR_MPM_L3CACHE_MSHR_ST 12'hB17 // MSHR stalls
`define VX_CSR_MPM_L3CACHE_MSHR_ST_H 12'hB97
// PERF: memory
`define VX_CSR_MPM_MEM_READS 12'hB18 // total reads
`define VX_CSR_MPM_MEM_READS_H 12'hB98
`define VX_CSR_MPM_MEM_WRITES 12'hB19 // total writes
`define VX_CSR_MPM_MEM_WRITES_H 12'hB99
`define VX_CSR_MPM_MEM_LT 12'hB1A // memory latency
`define VX_CSR_MPM_MEM_LT_H 12'hB9A
// PERF: smem
`define VX_CSR_MPM_SMEM_READS 12'hB1B // memory reads
`define VX_CSR_MPM_SMEM_READS_H 12'hB9B
`define VX_CSR_MPM_SMEM_WRITES 12'hB1C // memory writes
`define VX_CSR_MPM_SMEM_WRITES_H 12'hB9C
`define VX_CSR_MPM_SMEM_BANK_ST 12'hB1D // bank conflicts
`define VX_CSR_MPM_SMEM_BANK_ST_H 12'hB9D
// Machine Information Registers
`define VX_CSR_MVENDORID 12'hF11
`define VX_CSR_MARCHID 12'hF12
`define VX_CSR_MIMPID 12'hF13
`define VX_CSR_MHARTID 12'hF14
// GPGU CSRs
`define VX_CSR_THREAD_ID 12'hCC0
`define VX_CSR_WARP_ID 12'hCC1
`define VX_CSR_CORE_ID 12'hCC2
`define VX_CSR_WARP_MASK 12'hCC3
`define VX_CSR_THREAD_MASK 12'hCC4 // warning! this value is also used in LLVM
`define VX_CSR_GCID 12'hCC5 // legacy global core id alias used by Radiance bootrom
`define VX_CSR_NUM_THREADS 12'hFC0
`define VX_CSR_NUM_WARPS 12'hFC1
`define VX_CSR_NUM_CORES 12'hFC2
// CISC Accelerator Invocation
`define VX_CSR_ACCEL_CISC 12'hACC
`endif // VX_TYPES_VH

View File

@@ -1,254 +0,0 @@
`include "VX_define.vh"
module VX_warp_sched #(
parameter CORE_ID = 0
) (
`SCOPE_IO_VX_warp_sched
input wire clk,
input wire reset,
VX_warp_ctl_if.slave warp_ctl_if,
VX_wstall_if.slave wstall_if,
VX_join_if.slave join_if,
VX_branch_ctl_if.slave branch_ctl_if,
VX_ifetch_req_if.master ifetch_req_if,
VX_fetch_to_csr_if.master fetch_to_csr_if,
output wire busy
);
`UNUSED_PARAM (CORE_ID)
wire join_else;
wire [31:0] join_pc;
wire [`NUM_THREADS-1:0] join_tmask;
reg [`NUM_WARPS-1:0] active_warps, active_warps_n; // real active warps (updated when a warp is activated or disabled)
reg [`NUM_WARPS-1:0] stalled_warps; // asserted when a branch/gpgpu instructions are issued
reg [`NUM_WARPS-1:0][`NUM_THREADS-1:0] thread_masks;
reg [`NUM_WARPS-1:0][31:0] warp_pcs;
// barriers
reg [`NUM_BARRIERS-1:0][`NUM_WARPS-1:0] barrier_masks; // warps waiting on barrier
wire reached_barrier_limit; // the expected number of warps reached the barrier
// wspawn
reg [31:0] wspawn_pc;
reg [`NUM_WARPS-1:0] use_wspawn;
wire [`NW_BITS-1:0] schedule_wid;
wire [`NUM_THREADS-1:0] schedule_tmask;
wire [31:0] schedule_pc;
wire schedule_valid;
wire warp_scheduled;
reg [`UUID_BITS-1:0] issued_instrs;
wire ifetch_req_fire = ifetch_req_if.valid && ifetch_req_if.ready;
wire tmc_active = (warp_ctl_if.tmc.tmask != 0);
always @(*) begin
active_warps_n = active_warps;
if (warp_ctl_if.valid && warp_ctl_if.wspawn.valid) begin
active_warps_n = warp_ctl_if.wspawn.wmask;
end
if (warp_ctl_if.valid && warp_ctl_if.tmc.valid) begin
active_warps_n[warp_ctl_if.wid] = tmc_active;
end
end
always @(posedge clk) begin
if (reset) begin
barrier_masks <= '0;
use_wspawn <= '0;
stalled_warps <= '0;
warp_pcs <= '0;
active_warps <= '0;
thread_masks <= '0;
issued_instrs <= '0;
// activate first warp
warp_pcs[0] <= `STARTUP_ADDR;
active_warps[0] <= 1;
thread_masks[0] <= 1;
end else begin
if (warp_ctl_if.valid && warp_ctl_if.wspawn.valid) begin
use_wspawn <= warp_ctl_if.wspawn.wmask & (~`NUM_WARPS'(1));
wspawn_pc <= warp_ctl_if.wspawn.pc;
end
if (warp_ctl_if.valid && warp_ctl_if.barrier.valid) begin
stalled_warps[warp_ctl_if.wid] <= 0;
if (reached_barrier_limit) begin
barrier_masks[warp_ctl_if.barrier.id] <= 0;
end else begin
barrier_masks[warp_ctl_if.barrier.id][warp_ctl_if.wid] <= 1;
end
end
if (warp_ctl_if.valid && warp_ctl_if.tmc.valid) begin
thread_masks[warp_ctl_if.wid] <= warp_ctl_if.tmc.tmask;
stalled_warps[warp_ctl_if.wid] <= 0;
end
if (warp_ctl_if.valid && warp_ctl_if.split.valid) begin
stalled_warps[warp_ctl_if.wid] <= 0;
if (warp_ctl_if.split.diverged) begin
thread_masks[warp_ctl_if.wid] <= warp_ctl_if.split.then_tmask;
end
end
// Branch
if (branch_ctl_if.valid) begin
if (branch_ctl_if.taken) begin
warp_pcs[branch_ctl_if.wid] <= branch_ctl_if.dest;
end
stalled_warps[branch_ctl_if.wid] <= 0;
end
if (warp_scheduled) begin
// stall the warp until decode stage
stalled_warps[schedule_wid] <= 1;
// release wspawn
use_wspawn[schedule_wid] <= 0;
if (use_wspawn[schedule_wid]) begin
thread_masks[schedule_wid] <= 1;
end
issued_instrs <= issued_instrs + 1;
end
if (ifetch_req_fire) begin
warp_pcs[ifetch_req_if.wid] <= ifetch_req_if.PC + 4;
end
if (wstall_if.valid) begin
stalled_warps[wstall_if.wid] <= wstall_if.stalled;
end
// join handling
if (join_if.valid) begin
if (join_else) begin
warp_pcs[join_if.wid] <= join_pc;
end
thread_masks[join_if.wid] <= join_tmask;
end
active_warps <= active_warps_n;
end
end
// export thread mask register
assign fetch_to_csr_if.thread_masks = thread_masks;
// calculate active barrier status
`IGNORE_UNUSED_BEGIN
wire [`NW_BITS:0] active_barrier_count;
`IGNORE_UNUSED_END
wire [`NUM_WARPS-1:0] barrier_mask = barrier_masks[warp_ctl_if.barrier.id];
`POP_COUNT(active_barrier_count, barrier_mask);
assign reached_barrier_limit = (active_barrier_count[`NW_BITS-1:0] == warp_ctl_if.barrier.size_m1);
reg [`NUM_WARPS-1:0] barrier_stalls;
always @(*) begin
barrier_stalls = barrier_masks[0];
for (integer i = 1; i < `NUM_BARRIERS; ++i) begin
barrier_stalls |= barrier_masks[i];
end
end
// split/join stack management
wire [(32+`NUM_THREADS)-1:0] ipdom_data [`NUM_WARPS-1:0];
wire ipdom_index [`NUM_WARPS-1:0];
for (genvar i = 0; i < `NUM_WARPS; i++) begin
wire push = warp_ctl_if.valid
&& warp_ctl_if.split.valid
&& (i == warp_ctl_if.wid);
wire pop = join_if.valid && (i == join_if.wid);
wire [`NUM_THREADS-1:0] else_tmask = warp_ctl_if.split.else_tmask;
wire [`NUM_THREADS-1:0] orig_tmask = thread_masks[warp_ctl_if.wid];
wire [(32+`NUM_THREADS)-1:0] q_else = {warp_ctl_if.split.pc, else_tmask};
wire [(32+`NUM_THREADS)-1:0] q_end = {32'b0, orig_tmask};
VX_ipdom_stack #(
.WIDTH (32+`NUM_THREADS),
.DEPTH (2 ** (`NT_BITS+1))
) ipdom_stack (
.clk (clk),
.reset (reset),
.push (push),
.pop (pop),
.pair (warp_ctl_if.split.diverged),
.q1 (q_end),
.q2 (q_else),
.d (ipdom_data[i]),
.index (ipdom_index[i]),
`UNUSED_PIN (empty),
`UNUSED_PIN (full)
);
end
assign {join_pc, join_tmask} = ipdom_data[join_if.wid];
assign join_else = ~ipdom_index[join_if.wid];
// schedule the next ready warp
wire [`NUM_WARPS-1:0] ready_warps = active_warps & ~(stalled_warps | barrier_stalls);
VX_lzc #(
.N (`NUM_WARPS)
) wid_select (
.in_i (ready_warps),
.cnt_o (schedule_wid),
.valid_o (schedule_valid)
);
wire [`NUM_WARPS-1:0][(`NUM_THREADS + 32)-1:0] schedule_data;
for (genvar i = 0; i < `NUM_WARPS; ++i) begin
assign schedule_data[i] = {(use_wspawn[i] ? `NUM_THREADS'(1) : thread_masks[i]),
(use_wspawn[i] ? wspawn_pc : warp_pcs[i])};
end
assign {schedule_tmask, schedule_pc} = schedule_data[schedule_wid];
wire stall_out = ~ifetch_req_if.ready && ifetch_req_if.valid;
assign warp_scheduled = schedule_valid && ~stall_out;
wire [`UUID_BITS-1:0] instr_uuid = (issued_instrs * `NUM_CORES * `NUM_CLUSTERS) + `UUID_BITS'(CORE_ID);
VX_pipe_register #(
.DATAW (1 + `UUID_BITS + `NUM_THREADS + 32 + `NW_BITS),
.RESETW (1)
) pipe_reg (
.clk (clk),
.reset (reset),
.enable (!stall_out),
.data_in ({schedule_valid, instr_uuid, schedule_tmask, schedule_pc, schedule_wid}),
.data_out ({ifetch_req_if.valid, ifetch_req_if.uuid, ifetch_req_if.tmask, ifetch_req_if.PC, ifetch_req_if.wid})
);
assign busy = (active_warps != 0);
`SCOPE_ASSIGN (wsched_scheduled, warp_scheduled);
`SCOPE_ASSIGN (wsched_schedule_uuid, instr_uuid);
`SCOPE_ASSIGN (wsched_active_warps, active_warps);
`SCOPE_ASSIGN (wsched_stalled_warps, stalled_warps);
`SCOPE_ASSIGN (wsched_schedule_wid, schedule_wid);
`SCOPE_ASSIGN (wsched_schedule_tmask, schedule_tmask);
`SCOPE_ASSIGN (wsched_schedule_pc, schedule_pc);
endmodule

View File

@@ -1,113 +0,0 @@
`include "VX_define.vh"
module VX_writeback #(
parameter CORE_ID = 0
) (
input wire clk,
input wire reset,
// inputs
VX_commit_if.slave alu_commit_if,
VX_commit_if.slave ld_commit_if,
VX_commit_if.slave csr_commit_if,
`ifdef EXT_F_ENABLE
VX_commit_if.slave fpu_commit_if,
`endif
VX_commit_if.slave gpu_commit_if,
// outputs
VX_writeback_if.master writeback_if
);
`UNUSED_PARAM (CORE_ID)
localparam DATAW = `NW_BITS + 32 + `NUM_THREADS + `NR_BITS + (`NUM_THREADS * 32) + 1;
`ifdef EXT_F_ENABLE
localparam NUM_RSPS = 5;
`else
localparam NUM_RSPS = 4;
`endif
wire wb_valid;
wire [`NW_BITS-1:0] wb_wid;
wire [31:0] wb_PC;
wire [`NUM_THREADS-1:0] wb_tmask;
wire [`NR_BITS-1:0] wb_rd;
wire [`NUM_THREADS-1:0][31:0] wb_data;
wire wb_eop;
wire [NUM_RSPS-1:0] rsp_valid;
wire [NUM_RSPS-1:0][DATAW-1:0] rsp_data;
wire [NUM_RSPS-1:0] rsp_ready;
wire stall;
assign rsp_valid = {
gpu_commit_if.valid && gpu_commit_if.wb,
csr_commit_if.valid && csr_commit_if.wb,
alu_commit_if.valid && alu_commit_if.wb,
`ifdef EXT_F_ENABLE
fpu_commit_if.valid && fpu_commit_if.wb,
`endif
ld_commit_if.valid && ld_commit_if.wb
};
assign rsp_data = {
{gpu_commit_if.wid, gpu_commit_if.PC, gpu_commit_if.tmask, gpu_commit_if.rd, gpu_commit_if.data, gpu_commit_if.eop},
{csr_commit_if.wid, csr_commit_if.PC, csr_commit_if.tmask, csr_commit_if.rd, csr_commit_if.data, csr_commit_if.eop},
{alu_commit_if.wid, alu_commit_if.PC, alu_commit_if.tmask, alu_commit_if.rd, alu_commit_if.data, alu_commit_if.eop},
`ifdef EXT_F_ENABLE
{fpu_commit_if.wid, fpu_commit_if.PC, fpu_commit_if.tmask, fpu_commit_if.rd, fpu_commit_if.data, fpu_commit_if.eop},
`endif
{ ld_commit_if.wid, ld_commit_if.PC, ld_commit_if.tmask, ld_commit_if.rd, ld_commit_if.data, ld_commit_if.eop}
};
VX_stream_arbiter #(
.NUM_REQS (NUM_RSPS),
.DATAW (DATAW),
.BUFFERED (1),
.TYPE ("R")
) rsp_arb (
.clk (clk),
.reset (reset),
.valid_in (rsp_valid),
.data_in (rsp_data),
.ready_in (rsp_ready),
.valid_out (wb_valid),
.data_out ({wb_wid, wb_PC, wb_tmask, wb_rd, wb_data, wb_eop}),
.ready_out (~stall)
);
assign ld_commit_if.ready = rsp_ready[0] || ~ld_commit_if.wb;
`ifdef EXT_F_ENABLE
assign fpu_commit_if.ready = rsp_ready[1] || ~fpu_commit_if.wb;
assign alu_commit_if.ready = rsp_ready[2] || ~alu_commit_if.wb;
assign csr_commit_if.ready = rsp_ready[3] || ~csr_commit_if.wb;
assign gpu_commit_if.ready = rsp_ready[4] || ~gpu_commit_if.wb;
`else
assign alu_commit_if.ready = rsp_ready[1] || ~alu_commit_if.wb;
assign csr_commit_if.ready = rsp_ready[2] || ~csr_commit_if.wb;
assign gpu_commit_if.ready = rsp_ready[3] || ~gpu_commit_if.wb;
`endif
assign stall = ~writeback_if.ready && writeback_if.valid;
VX_pipe_register #(
.DATAW (1 + DATAW),
.RESETW (1)
) pipe_reg (
.clk (clk),
.reset (reset),
.enable (~stall),
.data_in ({wb_valid, wb_wid, wb_PC, wb_tmask, wb_rd, wb_data, wb_eop}),
.data_out ({writeback_if.valid, writeback_if.wid, writeback_if.PC, writeback_if.tmask, writeback_if.rd, writeback_if.data, writeback_if.eop})
);
// special workaround to get RISC-V tests Pass/Fail status
reg [31:0] last_wb_value [`NUM_REGS-1:0] /* verilator public */;
always @(posedge clk) begin
if (writeback_if.valid && writeback_if.ready) begin
last_wb_value[writeback_if.rd] <= writeback_if.data[0];
end
end
endmodule

View File

@@ -1,7 +1,20 @@
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
`include "VX_define.vh"
module Vortex (
`SCOPE_IO_Vortex
module Vortex import VX_gpu_pkg::*; (
`SCOPE_IO_DECL
// Clock
input wire clk,
@@ -9,217 +22,194 @@ module Vortex (
// Memory request
output wire mem_req_valid,
output wire mem_req_rw,
output wire [`VX_MEM_BYTEEN_WIDTH-1:0] mem_req_byteen,
output wire mem_req_rw,
output wire [`VX_MEM_BYTEEN_WIDTH-1:0] mem_req_byteen,
output wire [`VX_MEM_ADDR_WIDTH-1:0] mem_req_addr,
output wire [`VX_MEM_DATA_WIDTH-1:0] mem_req_data,
output wire [`VX_MEM_TAG_WIDTH-1:0] mem_req_tag,
input wire mem_req_ready,
// Memory response
input wire mem_rsp_valid,
input wire mem_rsp_valid,
input wire [`VX_MEM_DATA_WIDTH-1:0] mem_rsp_data,
input wire [`VX_MEM_TAG_WIDTH-1:0] mem_rsp_tag,
output wire mem_rsp_ready,
// DCR write request
input wire dcr_wr_valid,
input wire [`VX_DCR_ADDR_WIDTH-1:0] dcr_wr_addr,
input wire [`VX_DCR_DATA_WIDTH-1:0] dcr_wr_data,
// Status
output wire busy
);
`STATIC_ASSERT((`L3_ENABLE == 0 || `NUM_CLUSTERS > 1), ("invalid parameter"))
wire [`NUM_CLUSTERS-1:0] per_cluster_mem_req_valid;
wire [`NUM_CLUSTERS-1:0] per_cluster_mem_req_rw;
wire [`NUM_CLUSTERS-1:0][`L2_MEM_BYTEEN_WIDTH-1:0] per_cluster_mem_req_byteen;
wire [`NUM_CLUSTERS-1:0][`L2_MEM_ADDR_WIDTH-1:0] per_cluster_mem_req_addr;
wire [`NUM_CLUSTERS-1:0][`L2_MEM_DATA_WIDTH-1:0] per_cluster_mem_req_data;
wire [`NUM_CLUSTERS-1:0][`L2_MEM_TAG_WIDTH-1:0] per_cluster_mem_req_tag;
wire [`NUM_CLUSTERS-1:0] per_cluster_mem_req_ready;
`ifdef PERF_ENABLE
VX_mem_perf_if mem_perf_if();
assign mem_perf_if.icache = 'x;
assign mem_perf_if.dcache = 'x;
assign mem_perf_if.l2cache = 'x;
`endif
wire [`NUM_CLUSTERS-1:0] per_cluster_mem_rsp_valid;
wire [`NUM_CLUSTERS-1:0][`L2_MEM_DATA_WIDTH-1:0] per_cluster_mem_rsp_data;
wire [`NUM_CLUSTERS-1:0][`L2_MEM_TAG_WIDTH-1:0] per_cluster_mem_rsp_tag;
wire [`NUM_CLUSTERS-1:0] per_cluster_mem_rsp_ready;
VX_mem_bus_if #(
.DATA_SIZE (`L2_LINE_SIZE),
.TAG_WIDTH (L2_MEM_TAG_WIDTH)
) per_cluster_mem_bus_if[`NUM_CLUSTERS]();
wire [`NUM_CLUSTERS-1:0] per_cluster_busy;
VX_mem_bus_if #(
.DATA_SIZE (`L3_LINE_SIZE),
.TAG_WIDTH (L3_MEM_TAG_WIDTH)
) mem_bus_if();
for (genvar i = 0; i < `NUM_CLUSTERS; i++) begin
`RESET_RELAY (l3_reset, reset);
`RESET_RELAY (cluster_reset);
VX_cache_wrap #(
.INSTANCE_ID ("l3cache"),
.CACHE_SIZE (`L3_CACHE_SIZE),
.LINE_SIZE (`L3_LINE_SIZE),
.NUM_BANKS (`L3_NUM_BANKS),
.NUM_WAYS (`L3_NUM_WAYS),
.WORD_SIZE (L3_WORD_SIZE),
.NUM_REQS (L3_NUM_REQS),
.CRSQ_SIZE (`L3_CRSQ_SIZE),
.MSHR_SIZE (`L3_MSHR_SIZE),
.MRSQ_SIZE (`L3_MRSQ_SIZE),
.MREQ_SIZE (`L3_MREQ_SIZE),
.TAG_WIDTH (L2_MEM_TAG_WIDTH),
.WRITE_ENABLE (1),
.UUID_WIDTH (`UUID_WIDTH),
.CORE_OUT_REG (2),
.MEM_OUT_REG (2),
.NC_ENABLE (1),
.PASSTHRU (!`L3_ENABLED)
) l3cache (
.clk (clk),
.reset (l3_reset),
VX_cluster #(
.CLUSTER_ID(i)
) cluster (
`SCOPE_BIND_Vortex_cluster(i)
.clk (clk),
.reset (cluster_reset),
.mem_req_valid (per_cluster_mem_req_valid [i]),
.mem_req_rw (per_cluster_mem_req_rw [i]),
.mem_req_byteen (per_cluster_mem_req_byteen[i]),
.mem_req_addr (per_cluster_mem_req_addr [i]),
.mem_req_data (per_cluster_mem_req_data [i]),
.mem_req_tag (per_cluster_mem_req_tag [i]),
.mem_req_ready (per_cluster_mem_req_ready [i]),
.mem_rsp_valid (per_cluster_mem_rsp_valid [i]),
.mem_rsp_data (per_cluster_mem_rsp_data [i]),
.mem_rsp_tag (per_cluster_mem_rsp_tag [i]),
.mem_rsp_ready (per_cluster_mem_rsp_ready [i]),
.busy (per_cluster_busy [i])
);
end
assign busy = (| per_cluster_busy);
if (`L3_ENABLE) begin
`ifdef PERF_ENABLE
VX_perf_cache_if perf_l3cache_if();
.cache_perf (mem_perf_if.l3cache),
`endif
`RESET_RELAY (l3_reset);
.core_bus_if (per_cluster_mem_bus_if),
.mem_bus_if (mem_bus_if)
);
assign mem_req_valid = mem_bus_if.req_valid;
assign mem_req_rw = mem_bus_if.req_data.rw;
assign mem_req_byteen= mem_bus_if.req_data.byteen;
assign mem_req_addr = mem_bus_if.req_data.addr;
assign mem_req_data = mem_bus_if.req_data.data;
assign mem_req_tag = mem_bus_if.req_data.tag;
assign mem_bus_if.req_ready = mem_req_ready;
assign mem_bus_if.rsp_valid = mem_rsp_valid;
assign mem_bus_if.rsp_data.data = mem_rsp_data;
assign mem_bus_if.rsp_data.tag = mem_rsp_tag;
assign mem_rsp_ready = mem_bus_if.rsp_ready;
wire mem_req_fire = mem_req_valid && mem_req_ready;
wire mem_rsp_fire = mem_rsp_valid && mem_rsp_ready;
`UNUSED_VAR (mem_req_fire)
`UNUSED_VAR (mem_rsp_fire)
wire sim_ebreak /* verilator public */;
wire [`NUM_REGS-1:0][`XLEN-1:0] sim_wb_value /* verilator public */;
wire [`NUM_CLUSTERS-1:0] per_cluster_sim_ebreak;
wire [`NUM_CLUSTERS-1:0][`NUM_REGS-1:0][`XLEN-1:0] per_cluster_sim_wb_value;
assign sim_ebreak = per_cluster_sim_ebreak[0];
assign sim_wb_value = per_cluster_sim_wb_value[0];
`UNUSED_VAR (per_cluster_sim_ebreak)
`UNUSED_VAR (per_cluster_sim_wb_value)
VX_dcr_bus_if dcr_bus_if();
assign dcr_bus_if.write_valid = dcr_wr_valid;
assign dcr_bus_if.write_addr = dcr_wr_addr;
assign dcr_bus_if.write_data = dcr_wr_data;
wire [`NUM_CLUSTERS-1:0] per_cluster_busy;
`SCOPE_IO_SWITCH (`NUM_CLUSTERS)
// Generate all clusters
for (genvar i = 0; i < `NUM_CLUSTERS; ++i) begin
`RESET_RELAY (cluster_reset, reset);
`BUFFER_DCR_BUS_IF (cluster_dcr_bus_if, dcr_bus_if, (`NUM_CLUSTERS > 1));
VX_cluster #(
.CLUSTER_ID (i)
) cluster (
`SCOPE_IO_BIND (i)
VX_cache #(
.CACHE_ID (`L3_CACHE_ID),
.CACHE_SIZE (`L3_CACHE_SIZE),
.CACHE_LINE_SIZE (`L3_CACHE_LINE_SIZE),
.NUM_BANKS (`L3_NUM_BANKS),
.NUM_PORTS (`L3_NUM_PORTS),
.WORD_SIZE (`L3_WORD_SIZE),
.NUM_REQS (`L3_NUM_REQS),
.CREQ_SIZE (`L3_CREQ_SIZE),
.CRSQ_SIZE (`L3_CRSQ_SIZE),
.MSHR_SIZE (`L3_MSHR_SIZE),
.MRSQ_SIZE (`L3_MRSQ_SIZE),
.MREQ_SIZE (`L3_MREQ_SIZE),
.WRITE_ENABLE (1),
.CORE_TAG_WIDTH (`L2_MEM_TAG_WIDTH),
.CORE_TAG_ID_BITS (0),
.MEM_TAG_WIDTH (`L3_MEM_TAG_WIDTH),
.NC_ENABLE (1)
) l3cache (
`SCOPE_BIND_Vortex_l3cache
.clk (clk),
.reset (l3_reset),
.reset (cluster_reset),
`ifdef PERF_ENABLE
.perf_cache_if (perf_l3cache_if),
.mem_perf_if (mem_perf_if),
`endif
// Core request
.core_req_valid (per_cluster_mem_req_valid),
.core_req_rw (per_cluster_mem_req_rw),
.core_req_byteen (per_cluster_mem_req_byteen),
.core_req_addr (per_cluster_mem_req_addr),
.core_req_data (per_cluster_mem_req_data),
.core_req_tag (per_cluster_mem_req_tag),
.core_req_ready (per_cluster_mem_req_ready),
// Core response
.core_rsp_valid (per_cluster_mem_rsp_valid),
.core_rsp_data (per_cluster_mem_rsp_data),
.core_rsp_tag (per_cluster_mem_rsp_tag),
.core_rsp_ready (per_cluster_mem_rsp_ready),
`UNUSED_PIN (core_rsp_tmask),
// Memory request
.mem_req_valid (mem_req_valid),
.mem_req_rw (mem_req_rw),
.mem_req_byteen (mem_req_byteen),
.mem_req_addr (mem_req_addr),
.mem_req_data (mem_req_data),
.mem_req_tag (mem_req_tag),
.mem_req_ready (mem_req_ready),
// Memory response
.mem_rsp_valid (mem_rsp_valid),
.mem_rsp_data (mem_rsp_data),
.mem_rsp_tag (mem_rsp_tag),
.mem_rsp_ready (mem_rsp_ready)
);
end else begin
`RESET_RELAY (mem_arb_reset);
VX_mem_arb #(
.NUM_REQS (`NUM_CLUSTERS),
.DATA_WIDTH (`L3_MEM_DATA_WIDTH),
.ADDR_WIDTH (`L3_MEM_ADDR_WIDTH),
.TAG_IN_WIDTH (`L2_MEM_TAG_WIDTH),
.TYPE ("R"),
.BUFFERED_REQ (1),
.BUFFERED_RSP (1)
) mem_arb (
.clk (clk),
.reset (mem_arb_reset),
// Core request
.req_valid_in (per_cluster_mem_req_valid),
.req_rw_in (per_cluster_mem_req_rw),
.req_byteen_in (per_cluster_mem_req_byteen),
.req_addr_in (per_cluster_mem_req_addr),
.req_data_in (per_cluster_mem_req_data),
.req_tag_in (per_cluster_mem_req_tag),
.req_ready_in (per_cluster_mem_req_ready),
// Memory request
.req_valid_out (mem_req_valid),
.req_rw_out (mem_req_rw),
.req_byteen_out (mem_req_byteen),
.req_addr_out (mem_req_addr),
.req_data_out (mem_req_data),
.req_tag_out (mem_req_tag),
.req_ready_out (mem_req_ready),
// Core response
.rsp_valid_out (per_cluster_mem_rsp_valid),
.rsp_data_out (per_cluster_mem_rsp_data),
.rsp_tag_out (per_cluster_mem_rsp_tag),
.rsp_ready_out (per_cluster_mem_rsp_ready),
// Memory response
.rsp_valid_in (mem_rsp_valid),
.rsp_tag_in (mem_rsp_tag),
.rsp_data_in (mem_rsp_data),
.rsp_ready_in (mem_rsp_ready)
);
.dcr_bus_if (cluster_dcr_bus_if),
.mem_bus_if (per_cluster_mem_bus_if[i]),
.sim_ebreak (per_cluster_sim_ebreak[i]),
.sim_wb_value (per_cluster_sim_wb_value[i]),
.busy (per_cluster_busy[i])
);
end
`SCOPE_ASSIGN (reset, reset);
`SCOPE_ASSIGN (mem_req_fire, mem_req_valid && mem_req_ready);
`SCOPE_ASSIGN (mem_req_addr, `TO_FULL_ADDR(mem_req_addr));
`SCOPE_ASSIGN (mem_req_rw, mem_req_rw);
`SCOPE_ASSIGN (mem_req_byteen, mem_req_byteen);
`SCOPE_ASSIGN (mem_req_data, mem_req_data);
`SCOPE_ASSIGN (mem_req_tag, mem_req_tag);
`SCOPE_ASSIGN (mem_rsp_fire, mem_rsp_valid && mem_rsp_ready);
`SCOPE_ASSIGN (mem_rsp_data, mem_rsp_data);
`SCOPE_ASSIGN (mem_rsp_tag, mem_rsp_tag);
`SCOPE_ASSIGN (busy, busy);
`BUFFER_EX(busy, (| per_cluster_busy), 1'b1, (`NUM_CLUSTERS > 1));
`ifdef PERF_ENABLE
reg [`PERF_CTR_BITS-1:0] perf_mem_pending_reads;
mem_perf_t mem_perf;
always @(posedge clk) begin
if (reset) begin
perf_mem_pending_reads <= '0;
end else begin
perf_mem_pending_reads <= $signed(perf_mem_pending_reads) +
`PERF_CTR_BITS'($signed(2'(mem_req_fire && ~mem_bus_if.req_data.rw) - 2'(mem_rsp_fire)));
end
end
wire mem_rd_req_fire = mem_req_fire && ~mem_bus_if.req_data.rw;
wire mem_wr_req_fire = mem_req_fire && mem_bus_if.req_data.rw;
always @(posedge clk) begin
if (reset) begin
mem_perf <= '0;
end else begin
mem_perf.reads <= mem_perf.reads + `PERF_CTR_BITS'(mem_rd_req_fire);
mem_perf.writes <= mem_perf.writes + `PERF_CTR_BITS'(mem_wr_req_fire);
mem_perf.latency <= mem_perf.latency + perf_mem_pending_reads;
end
end
assign mem_perf_if.mem = mem_perf;
`endif
`ifdef DBG_TRACE_CORE_MEM
always @(posedge clk) begin
if (mem_req_valid && mem_req_ready) begin
if (mem_req_fire) begin
if (mem_req_rw)
dpi_trace("%d: MEM Wr Req: addr=%0h, tag=%0h, byteen=%0h data=%0h\n", $time, `TO_FULL_ADDR(mem_req_addr), mem_req_tag, mem_req_byteen, mem_req_data);
`TRACE(1, ("%d: MEM Wr Req: addr=0x%0h, tag=0x%0h, byteen=0x%0h data=0x%0h\n", $time, `TO_FULL_ADDR(mem_req_addr), mem_req_tag, mem_req_byteen, mem_req_data));
else
dpi_trace("%d: MEM Rd Req: addr=%0h, tag=%0h, byteen=%0h\n", $time, `TO_FULL_ADDR(mem_req_addr), mem_req_tag, mem_req_byteen);
`TRACE(1, ("%d: MEM Rd Req: addr=0x%0h, tag=0x%0h, byteen=0x%0h\n", $time, `TO_FULL_ADDR(mem_req_addr), mem_req_tag, mem_req_byteen));
end
if (mem_rsp_valid && mem_rsp_ready) begin
dpi_trace("%d: MEM Rsp: tag=%0h, data=%0h\n", $time, mem_rsp_tag, mem_rsp_data);
if (mem_rsp_fire) begin
`TRACE(1, ("%d: MEM Rsp: tag=0x%0h, data=0x%0h\n", $time, mem_rsp_tag, mem_rsp_data));
end
end
`endif
`ifndef NDEBUG
`ifdef SIMULATION
always @(posedge clk) begin
$fflush(); // flush stdout buffer
end
`endif
endmodule
endmodule

View File

@@ -1,65 +1,91 @@
// Copyright © 2019-2023
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
`include "VX_define.vh"
module Vortex_axi #(
parameter AXI_DATA_WIDTH = `VX_MEM_DATA_WIDTH,
parameter AXI_ADDR_WIDTH = 32,
parameter AXI_TID_WIDTH = `VX_MEM_TAG_WIDTH,
parameter AXI_STROBE_WIDTH = (`VX_MEM_DATA_WIDTH / 8)
module Vortex_axi import VX_gpu_pkg::*; #(
parameter AXI_DATA_WIDTH = `VX_MEM_DATA_WIDTH,
parameter AXI_ADDR_WIDTH = `XLEN,
parameter AXI_TID_WIDTH = `VX_MEM_TAG_WIDTH,
parameter AXI_NUM_BANKS = 1
)(
`SCOPE_IO_DECL
// Clock
input wire clk,
input wire reset,
// AXI write request address channel
output wire [AXI_TID_WIDTH-1:0] m_axi_awid,
output wire [AXI_ADDR_WIDTH-1:0] m_axi_awaddr,
output wire [7:0] m_axi_awlen,
output wire [2:0] m_axi_awsize,
output wire [1:0] m_axi_awburst,
output wire m_axi_awlock,
output wire [3:0] m_axi_awcache,
output wire [2:0] m_axi_awprot,
output wire [3:0] m_axi_awqos,
output wire m_axi_awvalid,
input wire m_axi_awready,
output wire m_axi_awvalid [AXI_NUM_BANKS],
input wire m_axi_awready [AXI_NUM_BANKS],
output wire [AXI_ADDR_WIDTH-1:0] m_axi_awaddr [AXI_NUM_BANKS],
output wire [AXI_TID_WIDTH-1:0] m_axi_awid [AXI_NUM_BANKS],
output wire [7:0] m_axi_awlen [AXI_NUM_BANKS],
output wire [2:0] m_axi_awsize [AXI_NUM_BANKS],
output wire [1:0] m_axi_awburst [AXI_NUM_BANKS],
output wire [1:0] m_axi_awlock [AXI_NUM_BANKS],
output wire [3:0] m_axi_awcache [AXI_NUM_BANKS],
output wire [2:0] m_axi_awprot [AXI_NUM_BANKS],
output wire [3:0] m_axi_awqos [AXI_NUM_BANKS],
output wire [3:0] m_axi_awregion [AXI_NUM_BANKS],
// AXI write request data channel
output wire [AXI_DATA_WIDTH-1:0] m_axi_wdata,
output wire [AXI_STROBE_WIDTH-1:0] m_axi_wstrb,
output wire m_axi_wlast,
output wire m_axi_wvalid,
input wire m_axi_wready,
output wire m_axi_wvalid [AXI_NUM_BANKS],
input wire m_axi_wready [AXI_NUM_BANKS],
output wire [AXI_DATA_WIDTH-1:0] m_axi_wdata [AXI_NUM_BANKS],
output wire [AXI_DATA_WIDTH/8-1:0] m_axi_wstrb [AXI_NUM_BANKS],
output wire m_axi_wlast [AXI_NUM_BANKS],
// AXI write response channel
input wire [AXI_TID_WIDTH-1:0] m_axi_bid,
input wire [1:0] m_axi_bresp,
input wire m_axi_bvalid,
output wire m_axi_bready,
input wire m_axi_bvalid [AXI_NUM_BANKS],
output wire m_axi_bready [AXI_NUM_BANKS],
input wire [AXI_TID_WIDTH-1:0] m_axi_bid [AXI_NUM_BANKS],
input wire [1:0] m_axi_bresp [AXI_NUM_BANKS],
// AXI read request channel
output wire [AXI_TID_WIDTH-1:0] m_axi_arid,
output wire [AXI_ADDR_WIDTH-1:0] m_axi_araddr,
output wire [7:0] m_axi_arlen,
output wire [2:0] m_axi_arsize,
output wire [1:0] m_axi_arburst,
output wire m_axi_arlock,
output wire [3:0] m_axi_arcache,
output wire [2:0] m_axi_arprot,
output wire [3:0] m_axi_arqos,
output wire m_axi_arvalid,
input wire m_axi_arready,
output wire m_axi_arvalid [AXI_NUM_BANKS],
input wire m_axi_arready [AXI_NUM_BANKS],
output wire [AXI_ADDR_WIDTH-1:0] m_axi_araddr [AXI_NUM_BANKS],
output wire [AXI_TID_WIDTH-1:0] m_axi_arid [AXI_NUM_BANKS],
output wire [7:0] m_axi_arlen [AXI_NUM_BANKS],
output wire [2:0] m_axi_arsize [AXI_NUM_BANKS],
output wire [1:0] m_axi_arburst [AXI_NUM_BANKS],
output wire [1:0] m_axi_arlock [AXI_NUM_BANKS],
output wire [3:0] m_axi_arcache [AXI_NUM_BANKS],
output wire [2:0] m_axi_arprot [AXI_NUM_BANKS],
output wire [3:0] m_axi_arqos [AXI_NUM_BANKS],
output wire [3:0] m_axi_arregion [AXI_NUM_BANKS],
// AXI read response channel
input wire [AXI_TID_WIDTH-1:0] m_axi_rid,
input wire [AXI_DATA_WIDTH-1:0] m_axi_rdata,
input wire [1:0] m_axi_rresp,
input wire m_axi_rlast,
input wire m_axi_rvalid,
output wire m_axi_rready,
input wire m_axi_rvalid [AXI_NUM_BANKS],
output wire m_axi_rready [AXI_NUM_BANKS],
input wire [AXI_DATA_WIDTH-1:0] m_axi_rdata [AXI_NUM_BANKS],
input wire m_axi_rlast [AXI_NUM_BANKS],
input wire [AXI_TID_WIDTH-1:0] m_axi_rid [AXI_NUM_BANKS],
input wire [1:0] m_axi_rresp [AXI_NUM_BANKS],
// DCR write request
input wire dcr_wr_valid,
input wire [`VX_DCR_ADDR_WIDTH-1:0] dcr_wr_addr,
input wire [`VX_DCR_DATA_WIDTH-1:0] dcr_wr_data,
// Status
output wire busy
);
`STATIC_ASSERT((AXI_DATA_WIDTH == `VX_MEM_DATA_WIDTH), ("invalid memory data size: current=%0d, expected=%0d", AXI_DATA_WIDTH, `VX_MEM_DATA_WIDTH))
`STATIC_ASSERT((AXI_ADDR_WIDTH >= `XLEN), ("invalid memory address size: current=%0d, expected=%0d", AXI_ADDR_WIDTH, `VX_MEM_ADDR_WIDTH))
//`STATIC_ASSERT((AXI_TID_WIDTH >= `VX_MEM_TAG_WIDTH), ("invalid memory tag size: current=%0d, expected=%0d", AXI_TID_WIDTH, `VX_MEM_TAG_WIDTH))
wire mem_req_valid;
wire mem_req_rw;
wire [`VX_MEM_BYTEEN_WIDTH-1:0] mem_req_byteen;
@@ -72,16 +98,33 @@ module Vortex_axi #(
wire [`VX_MEM_DATA_WIDTH-1:0] mem_rsp_data;
wire [`VX_MEM_TAG_WIDTH-1:0] mem_rsp_tag;
wire mem_rsp_ready;
wire [`XLEN-1:0] m_axi_awaddr_unqual [AXI_NUM_BANKS];
wire [`XLEN-1:0] m_axi_araddr_unqual [AXI_NUM_BANKS];
wire [`VX_MEM_TAG_WIDTH-1:0] m_axi_awid_unqual [AXI_NUM_BANKS];
wire [`VX_MEM_TAG_WIDTH-1:0] m_axi_arid_unqual [AXI_NUM_BANKS];
wire [`VX_MEM_TAG_WIDTH-1:0] m_axi_bid_unqual [AXI_NUM_BANKS];
wire [`VX_MEM_TAG_WIDTH-1:0] m_axi_rid_unqual [AXI_NUM_BANKS];
for (genvar i = 0; i < AXI_NUM_BANKS; ++i) begin
assign m_axi_awaddr[i] = `XLEN'(m_axi_awaddr_unqual[i]);
assign m_axi_araddr[i] = `XLEN'(m_axi_araddr_unqual[i]);
assign m_axi_awid[i] = AXI_TID_WIDTH'(m_axi_awid_unqual[i]);
assign m_axi_arid[i] = AXI_TID_WIDTH'(m_axi_arid_unqual[i]);
assign m_axi_rid_unqual[i] = `VX_MEM_TAG_WIDTH'(m_axi_rid[i]);
assign m_axi_bid_unqual[i] = `VX_MEM_TAG_WIDTH'(m_axi_bid[i]);
end
VX_axi_adapter #(
.VX_DATA_WIDTH (`VX_MEM_DATA_WIDTH),
.VX_ADDR_WIDTH (`VX_MEM_ADDR_WIDTH),
.VX_TAG_WIDTH (`VX_MEM_TAG_WIDTH),
.VX_BYTEEN_WIDTH (AXI_STROBE_WIDTH),
.AXI_DATA_WIDTH (AXI_DATA_WIDTH),
.AXI_ADDR_WIDTH (AXI_ADDR_WIDTH),
.AXI_TID_WIDTH (AXI_TID_WIDTH),
.AXI_STROBE_WIDTH (AXI_STROBE_WIDTH)
.DATA_WIDTH (`VX_MEM_DATA_WIDTH),
.ADDR_WIDTH (`XLEN),
.TAG_WIDTH (`VX_MEM_TAG_WIDTH),
.NUM_BANKS (AXI_NUM_BANKS),
.OUT_REG_RSP((AXI_NUM_BANKS > 1) ? 2 : 0)
) axi_adapter (
.clk (clk),
.reset (reset),
@@ -98,9 +141,11 @@ module Vortex_axi #(
.mem_rsp_data (mem_rsp_data),
.mem_rsp_tag (mem_rsp_tag),
.mem_rsp_ready (mem_rsp_ready),
.m_axi_awid (m_axi_awid),
.m_axi_awaddr (m_axi_awaddr),
.m_axi_awvalid (m_axi_awvalid),
.m_axi_awready (m_axi_awready),
.m_axi_awaddr (m_axi_awaddr_unqual),
.m_axi_awid (m_axi_awid_unqual),
.m_axi_awlen (m_axi_awlen),
.m_axi_awsize (m_axi_awsize),
.m_axi_awburst (m_axi_awburst),
@@ -108,22 +153,23 @@ module Vortex_axi #(
.m_axi_awcache (m_axi_awcache),
.m_axi_awprot (m_axi_awprot),
.m_axi_awqos (m_axi_awqos),
.m_axi_awvalid (m_axi_awvalid),
.m_axi_awready (m_axi_awready),
.m_axi_awregion (m_axi_awregion),
.m_axi_wvalid (m_axi_wvalid),
.m_axi_wready (m_axi_wready),
.m_axi_wdata (m_axi_wdata),
.m_axi_wstrb (m_axi_wstrb),
.m_axi_wlast (m_axi_wlast),
.m_axi_wvalid (m_axi_wvalid),
.m_axi_wready (m_axi_wready),
.m_axi_bid (m_axi_bid),
.m_axi_bresp (m_axi_bresp),
.m_axi_bvalid (m_axi_bvalid),
.m_axi_bready (m_axi_bready),
.m_axi_bid (m_axi_bid_unqual),
.m_axi_bresp (m_axi_bresp),
.m_axi_arid (m_axi_arid),
.m_axi_araddr (m_axi_araddr),
.m_axi_arvalid (m_axi_arvalid),
.m_axi_arready (m_axi_arready),
.m_axi_araddr (m_axi_araddr_unqual),
.m_axi_arid (m_axi_arid_unqual),
.m_axi_arlen (m_axi_arlen),
.m_axi_arsize (m_axi_arsize),
.m_axi_arburst (m_axi_arburst),
@@ -131,18 +177,21 @@ module Vortex_axi #(
.m_axi_arcache (m_axi_arcache),
.m_axi_arprot (m_axi_arprot),
.m_axi_arqos (m_axi_arqos),
.m_axi_arvalid (m_axi_arvalid),
.m_axi_arready (m_axi_arready),
.m_axi_arregion (m_axi_arregion),
.m_axi_rid (m_axi_rid),
.m_axi_rdata (m_axi_rdata),
.m_axi_rresp (m_axi_rresp),
.m_axi_rlast (m_axi_rlast),
.m_axi_rvalid (m_axi_rvalid),
.m_axi_rready (m_axi_rready)
.m_axi_rready (m_axi_rready),
.m_axi_rdata (m_axi_rdata),
.m_axi_rlast (m_axi_rlast) ,
.m_axi_rid (m_axi_rid_unqual),
.m_axi_rresp (m_axi_rresp)
);
`SCOPE_IO_SWITCH (1)
Vortex vortex (
`SCOPE_IO_BIND (0)
.clk (clk),
.reset (reset),
@@ -159,7 +208,11 @@ module Vortex_axi #(
.mem_rsp_tag (mem_rsp_tag),
.mem_rsp_ready (mem_rsp_ready),
.dcr_wr_valid (dcr_wr_valid),
.dcr_wr_addr (dcr_wr_addr),
.dcr_wr_data (dcr_wr_data),
.busy (busy)
);
endmodule
endmodule

Some files were not shown because too many files have changed in this diff Show More