vortex/docs/microarchitecture.md

# Vortex Microarchitecture

### Vortex GPGPU Execution Model

Vortex uses the SIMT (Single Instruction, Multiple Threads) execution model with a single warp issued per cycle.

- **Threads**
  - Smallest unit of computation
  - Each thread has its own register file (32 int + 32 fp registers)
  - Threads execute in parallel
- **Warps**
  - A logical clster of threads
  - Each thread in a warp execute the same instruction
    - The PC is shared; maintain thread mask for Writeback
  - Warp's execution is time-multiplexed at log steps
    - Ex. warp 0 executes at cycle 0, warp 1 executes at cycle 1

### Vortex RISC-V ISA Extension

- **Thread Mask Control**
  - Control the number of warps to activate during execution
  - `TMC` *count*: activate count threads
- **Warp Scheduling**
  - Control the number of warps to activate during execution
  - `WSPAWN` *count, addr*: activate count warps and jump to addr location
- **Control-Flow Divergence**
  - Control threads to activate when a branch diverges
    - `SPLIT` *predicate*: apply 'taken' predicate thread mask adn save 'not-taken' into IPDOM stack
    - `JOIN`: restore 'not-taken' thread mask
- **Warp Synchronization**
  - `BAR` *id, count*: stall warps entering barrier *id* until count is reached

### Vortex Pipeline/Datapath

![Image of Vortex Microarchitecture](./assets/img/vortex_microarchitecture_v2.png)

Vortex has a 5-stage pipeline: FI | ID | Issue | EX | WB.

- **Fetch**
  - Warp Scheduler
    - Track stalled & active warps, resolve branches and barriers, maintain split/join IPDOM stack
  - Instruction Cache
    - Retrieve instruction from cache, issue I-cache requests/responses
- **Decode**
  - Decode fetched instructions, notify warp scheduler when the following instructions are decoded:
    - Branch, tmc, split/join, wspawn
  - Precompute used_regs mask (needed for Issue stage)
- **Issue**
  - Scheduling
    - In-order issue (operands/execute unit ready), out-of-order commit
  - IBuffer
    - Store fetched instructions, separate queues per-warp, selects next warp through round-robin scheduling
  - Scoreboard
    - Track in-use registers
  - GPRs (General-Purpose Registers) stage
    - Fetch issued instruction operands and send operands to execute unit
- **Execute**
  - ALU Unit
    - Single-cycle operations (+,-,>>,<<,&,|,^), Branch instructions (Share ALU resources)
  - MULDIV Unit
    - Multiplier - done in 2 cycles
    - Divider - division and remainder, done in 32 cycles
      - Implements serial alogrithm (Stalls the pipeline)
  - FPU Unit
    - Multi-cycle operations, uses `FPnew` Library on ASIC, uses hard DSPs on FPGA
  - CSR Unit
    - Store constant status registers - device caps, FPU status flags, performance counters
    - Handle external CSR requests (requests from host CPU)
  - LSU Unit
    - Handle load/store operations, issue D-cache requests, handle D-cache responses
    - Commit load responses - saves storage, Scoreboard tracks completion
  - GPGPU Unit
    - Handle GPGPU instructions
      - TMC, WSPAWN, SPLIT, BAR
    - JOIN is handled by Warp Scheduler (upon SPLIT response)
- **Commit**
  - Commit
    - Update CSR flags, update performance counters
  - Writeback
    - Write result back to GPRs, notify Scoreboard (release in-use register), select candidate instruction (ALU unit has highest priority)
- **Clustering**
  - Group mulitple cores into clusters (optionally share L2 cache)
  - Group multiple clusters (optionally share L3 cache)
  - Configurable at build time
  - Default configuration:
    - #Clusters = 1
    - #Cores = 4
    - #Warps = 4
    - #Threads = 4
- **FPGA AFU Interface**
  - Manage CPU-GPU comunication
    - Query devices caps, load kernel instructions and resource buffers, start kernel execution, read destination buffers
  - Local Memory - GPU access to local DRAM
  - Reserved I/O addresses - redirect to host CPU, console output