updated documentation

This commit is contained in:
Blaise Tine
2024-01-03 10:23:38 -08:00
parent bd18b03cc3
commit f2e8317412
7 changed files with 53 additions and 110 deletions

View File

@@ -24,71 +24,57 @@ Vortex uses the SIMT (Single Instruction, Multiple Threads) execution model with
- Control the number of warps to activate during execution
- `WSPAWN` *count, addr*: activate count warps and jump to addr location
- **Control-Flow Divergence**
- Control threads to activate when a branch diverges
- `SPLIT` *predicate*: apply 'taken' predicate thread mask adn save 'not-taken' into IPDOM stack
- `JOIN`: restore 'not-taken' thread mask
- Control threads activation when a branch diverges
- `SPLIT` *taken, predicate*: apply predicate thread mask and save current state into IPDOM stack
- `JOIN`: pop IPDOM stack to restore thread mask
- `PRED` *predicate, restore_mask*: thread predicate instruction
- **Warp Synchronization**
- `BAR` *id, count*: stall warps entering barrier *id* until count is reached
### Vortex Pipeline/Datapath
![Image of Vortex Microarchitecture](./assets/img/vortex_microarchitecture_v2.png)
![Image of Vortex Microarchitecture](./assets/img/vortex_microarchitecture.png)
Vortex has a 5-stage pipeline: FI | ID | Issue | EX | WB.
Vortex has a 6-stage pipeline:
- **Schedule**
- Warp Scheduler
- Schedule the next PC into the pipeline
- Track stalled, active warps
- IPDOM Stack
- Save split/join states for divergent threads
- Inflight Tracker
- Track in-flight instructions
- **Fetch**
- Warp Scheduler
- Track stalled & active warps, resolve branches and barriers, maintain split/join IPDOM stack
- Instruction Cache
- Retrieve instruction from cache, issue I-cache requests/responses
- Retrieve instructions from memory
- Handle I-cache requests/responses
- **Decode**
- Decode fetched instructions, notify warp scheduler when the following instructions are decoded:
- Branch, tmc, split/join, wspawn
- Precompute used_regs mask (needed for Issue stage)
- Decode fetched instructions
- Notify warp scheduler on control instructions
- **Issue**
- Scheduling
- In-order issue (operands/execute unit ready), out-of-order commit
- IBuffer
- Store fetched instructions, separate queues per-warp, selects next warp through round-robin scheduling
- Store decoded instructions in separate per-warp queues
- Scoreboard
- Track in-use registers
- GPRs (General-Purpose Registers) stage
- Fetch issued instruction operands and send operands to execute unit
- Check register use for decoded instructions
- Operands Collector
- Fetch the operands for issued instructions from the register file
- **Execute**
- ALU Unit
- Single-cycle operations (+,-,>>,<<,&,|,^), Branch instructions (Share ALU resources)
- MULDIV Unit
- Multiplier - done in 2 cycles
- Divider - division and remainder, done in 32 cycles
- Implements serial alogrithm (Stalls the pipeline)
- Handle arithmetic and branch operations
- FPU Unit
- Multi-cycle operations, uses `FPnew` Library on ASIC, uses hard DSPs on FPGA
- CSR Unit
- Store constant status registers - device caps, FPU status flags, performance counters
- Handle external CSR requests (requests from host CPU)
- Handle floating-point operations
- LSU Unit
- Handle load/store operations, issue D-cache requests, handle D-cache responses
- Commit load responses - saves storage, Scoreboard tracks completion
- GPGPU Unit
- Handle GPGPU instructions
- TMC, WSPAWN, SPLIT, BAR
- JOIN is handled by Warp Scheduler (upon SPLIT response)
- Handle load/store operations
- SFU Unit
- Handle warp control operations
- Handle Control Status Registers (CSRs) operations
- **Commit**
- Commit
- Update CSR flags, update performance counters
- Writeback
- Write result back to GPRs, notify Scoreboard (release in-use register), select candidate instruction (ALU unit has highest priority)
- **Clustering**
- Group mulitple cores into clusters (optionally share L2 cache)
- Group multiple clusters (optionally share L3 cache)
- Configurable at build time
- Default configuration:
- #Clusters = 1
- #Cores = 4
- #Warps = 4
- #Threads = 4
- **FPGA AFU Interface**
- Manage CPU-GPU comunication
- Query devices caps, load kernel instructions and resource buffers, start kernel execution, read destination buffers
- Local Memory - GPU access to local DRAM
- Reserved I/O addresses - redirect to host CPU, console output
- Write result back to the register file and update the Scoreboard.
### Vortex clustering architecture
- Sockets
- Grouping multiple cores sharing L1 cache
- Clusters
- Grouping of sockets sharing L2 cache