docs update

2021-10-19 16:32:31 -04:00
parent fe862f64b1
commit 956f3d1880
14 changed files with 19 additions and 15 deletions
--- a/docs/microarchitecture.md
+++ b/docs/microarchitecture.md
@@ -0,0 +1,94 @@
+# Vortex Microarchitecture
+
+### Vortex GPGPU Execution Model
+
+Vortex uses the SIMT (Single Instruction, Multiple Threads) execution model with a single warp issued per cycle.
+
+- **Threads**
+  - Smallest unit of computation
+  - Each thread has its own register file (32 int + 32 fp registers)
+  - Threads execute in parallel
+- **Warps**
+  - A logical clster of threads
+  - Each thread in a warp execute the same instruction
+    - The PC is shared; maintain thread mask for Writeback
+  - Warp's execution is time-multiplexed at log steps
+    - Ex. warp 0 executes at cycle 0, warp 1 executes at cycle 1
+
+### Vortex RISC-V ISA Extension
+
+- **Thread Mask Control**
+  - Control the number of warps to activate during execution
+  - `TMC` *count*: activate count threads
+- **Warp Scheduling**
+  - Control the number of warps to activate during execution
+  - `WSPAWN` *count, addr*: activate count warps and jump to addr location
+- **Control-Flow Divergence**
+  - Control threads to activate when a branch diverges
+    - `SPLIT` *predicate*: apply 'taken' predicate thread mask adn save 'not-taken' into IPDOM stack
+    - `JOIN`: restore 'not-taken' thread mask
+- **Warp Synchronization**
+  - `BAR` *id, count*: stall warps entering barrier *id* until count is reached
+
+### Vortex Pipeline/Datapath
+
+![Image of Vortex Microarchitecture](./assets/img/vortex_microarchitecture_v2.png)
+
+Vortex has a 5-stage pipeline: FI | ID | Issue | EX | WB.
+
+- **Fetch**
+  - Warp Scheduler
+    - Track stalled & active warps, resolve branches and barriers, maintain split/join IPDOM stack
+  - Instruction Cache
+    - Retrieve instruction from cache, issue I-cache requests/responses
+- **Decode**
+  - Decode fetched instructions, notify warp scheduler when the following instructions are decoded:
+    - Branch, tmc, split/join, wspawn
+  - Precompute used_regs mask (needed for Issue stage)
+- **Issue**
+  - Scheduling
+    - In-order issue (operands/execute unit ready), out-of-order commit
+  - IBuffer
+    - Store fetched instructions, separate queues per-warp, selects next warp through round-robin scheduling
+  - Scoreboard
+    - Track in-use registers
+  - GPRs (General-Purpose Registers) stage
+    - Fetch issued instruction operands and send operands to execute unit
+- **Execute**
+  - ALU Unit
+    - Single-cycle operations (+,-,>>,<<,&,|,^), Branch instructions (Share ALU resources)
+  - MULDIV Unit
+    - Multiplier - done in 2 cycles
+    - Divider - division and remainder, done in 32 cycles
+      - Implements serial alogrithm (Stalls the pipeline)
+  - FPU Unit
+    - Multi-cycle operations, uses `FPnew` Library on ASIC, uses hard DSPs on FPGA
+  - CSR Unit
+    - Store constant status registers - device caps, FPU status flags, performance counters
+    - Handle external CSR requests (requests from host CPU)
+  - LSU Unit
+    - Handle load/store operations, issue D-cache requests, handle D-cache responses
+    - Commit load responses - saves storage, Scoreboard tracks completion
+  - GPGPU Unit
+    - Handle GPGPU instructions
+      - TMC, WSPAWN, SPLIT, BAR
+    - JOIN is handled by Warp Scheduler (upon SPLIT response)
+- **Commit**
+  - Commit
+    - Update CSR flags, update performance counters
+  - Writeback
+    - Write result back to GPRs, notify Scoreboard (release in-use register), select candidate instruction (ALU unit has highest priority)
+- **Clustering**
+  - Group mulitple cores into clusters (optionally share L2 cache)
+  - Group multiple clusters (optionally share L3 cache)
+  - Configurable at build time
+  - Default configuration:
+    - #Clusters = 1
+    - #Cores = 4
+    - #Warps = 4
+    - #Threads = 4
+- **FPGA AFU Interface**
+  - Manage CPU-GPU comunication
+    - Query devices caps, load kernel instructions and resource buffers, start kernel execution, read destination buffers
+  - Local Memory - GPU access to local DRAM
+  - Reserved I/O addresses - redirect to host CPU, console output