diff --git a/README.md b/README.md index 71ba041e..80beccea 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ Directory structure - benchmarks: OpenCL and RISC-V benchmarks -- docs: documentation. +- docs: [documentation](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Vortex.md). - hw: hardware sources. diff --git a/doc/Codebase.md b/doc/Codebase.md new file mode 100644 index 00000000..3d018855 --- /dev/null +++ b/doc/Codebase.md @@ -0,0 +1,35 @@ +# Vortex Codebase + +The directory/file layout of the Vortex codebase is as followed: + +- `benchmark`: contains opencl, risc-v, and vector tests + - `opencl`: contains basic kernel operation tests (i.e. vector add, transpose, dot product) + - `riscv`: contains official riscv tests which are pre-compiled into binaries + - `vector`: tests for vector instructions (not yet implemented) +- `ci`: contain tests to be run during continuous integration (Travis CI) + - driver, opencl, riscv_isa, and runtime tests +- `driver`: contains driver software implementation (software that is run on the host to communicate with the vortex processor) + - `opae`: contains code for driver that runs on FPGA + - `rtlsim`: contains code for driver that runs on local machine (driver built using verilator which converts rtl to c++ binary) + - `simx`: contains code for driver that runs on local machine (vortex) + - `include`: contains vortex.h which has the vortex API that is used by the drivers +- `runtime`: contains software used inside kernel programs to expose GPGPU capabilities + - `include`: contains vortex API needed for runtime + - `linker`: contains linker file for compiling kernels + - `src`: contains implementation of vortex API (from include folder) + - `tests`: contains runtime tests + - `simple`: contains test for GPGPU functionality allowed in vortex +- `simx`: contains simX, the cycle approximate simulator for vortex +- `miscs`: contains old code that is no longer used +- `hw`: + - `unit_tests`: contains unit test for RTL of cache and queue + - `syn`: contains all synthesis scripts (quartus and yosys) + - `quartus`: contains code to synthesis cache, core, pipeline, top, and vortex stand-alone + - `simulate`: contains RTL simulator (verilator) + - `testbench.cpp`: runs either the riscv, runtime, or opencl tests + - `opae`: contains source code for the accelerator functional unit (AFU) and code which programs the fpga + - `rtl`: contains rtl source code + - `cache`: contains cache subsystem code + - `fp_cores`: contains floating point unit code + - `interfaces`: contains code that handles communication for each of the units of the microarchitecture + - `libs`: contains general-purpose modules (i.e., buffers, encoders, arbiters, pipe registers) \ No newline at end of file diff --git a/doc/Images/vortex_microarchitecture_v2.png b/doc/Images/vortex_microarchitecture_v2.png new file mode 100644 index 00000000..c0e85891 Binary files /dev/null and b/doc/Images/vortex_microarchitecture_v2.png differ diff --git a/doc/Microarchitecture.md b/doc/Microarchitecture.md new file mode 100644 index 00000000..1b410066 --- /dev/null +++ b/doc/Microarchitecture.md @@ -0,0 +1,94 @@ +# Vortex Microarchitecture + +### Vortex GPGPU Execution Model + +Vortex uses the SIMT (Single Instruction, Multiple Threads) execution model with a single warp issued per cycle. + +- **Threads** + - Smallest unit of computation + - Each thread has its own register file (32 int + 32 fp registers) + - Threads execute in parallel +- **Warps** + - A logical clster of threads + - Each thread in a warp execute the same instruction + - The PC is shared; maintain thread mask for Writeback + - Warp's execution is time-multiplexed at log steps + - Ex. warp 0 executes at cycle 0, warp 1 executes at cycle 1 + +### Vortex RISC-V ISA Extension + +- **Thread Mask Control** + - Control the number of warps to activate during execution + - `TMC` *count*: activate count threads +- **Warp Scheduling** + - Control the number of warps to activate during execution + - `WSPAWN` *count, addr*: activate count warps and jump to addr location +- **Control-Flow Divergence** + - Control threads to activate when a branch diverges + - `SPLIT` *predicate*: apply 'taken' predicate thread mask adn save 'not-taken' into IPDOM stack + - `JOIN`: restore 'not-taken' thread mask +- **Warp Synchronization** + - `BAR` *id, count*: stall warps entering barrier *id* until count is reached + +### Vortex Pipeline/Datapath + +![Image of Vortex Microarchitecture](./Images/vortex_microarchitecture_v2.png) + +Vortex has a 5-stage pipeline: FI | ID | Issue | EX | WB. + +- **Fetch** + - Warp Scheduler + - Track stalled & active warps, resolve branches and barriers, maintain split/join IPDOM stack + - Instruction Cache + - Retrieve instruction from cache, issue I-cache requests/responses +- **Decode** + - Decode fetched instructions, notify warp scheduler when the following instructions are decoded: + - Branch, tmc, split/join, wspawn + - Precompute used_regs mask (needed for Issue stage) +- **Issue** + - Scheduling + - In-order issue (operands/execute unit ready), out-of-order commit + - IBuffer + - Store fetched instructions, separate queues per-warp, selects next warp through round-robin scheduling + - Scoreboard + - Track in-use registers + - GPRs (General-Purpose Registers) stage + - Fetch issued instruction operands and send operands to execute unit +- **Execute** + - ALU Unit + - Single-cycle operations (+,-,>>,<<,&,|,^), Branch instructions (Share ALU resources) + - MULDIV Unit + - Multiplier - done in 2 cycles + - Divider - division and remainder, done in 32 cycles + - Implements serial alogrithm (Stalls the pipeline) + - FPU Unit + - Multi-cycle operations, uses `FPnew` Library on ASIC, uses hard DSPs on FPGA + - CSR Unit + - Store constant status registers - device caps, FPU status flags, performance counters + - Handle external CSR requests (requests from host CPU) + - LSU Unit + - Handle load/store operations, issue D-cache requests, handle D-cache responses + - Commit load responses - saves storage, Scoreboard tracks completion + - GPGPU Unit + - Handle GPGPU instructions + - TMC, WSPAWN, SPLIT, BAR + - JOIN is handled by Warp Scheduler (upon SPLIT response) +- **Commit** + - Commit + - Update CSR flags, update performance counters + - Writeback + - Write result back to GPRs, notify Scoreboard (release in-use register), select candidate instruction (ALU unit has highest priority) +- **Clustering** + - Group mulitple cores into clusters (optionally share L2 cache) + - Group multiple clusters (optionally share L3 cache) + - Configurable at build time + - Default configuration: + - #Clusters = 1 + - #Cores = 4 + - #Warps = 4 + - #Threads = 4 +- **FPGA AFU Interface** + - Manage CPU-GPU comunication + - Query devices caps, load kernel instructions and resource buffers, start kernel execution, read destination buffers + - Local Memory - GPU access to local DRAM + - Reserved I/O addresses - redirect to host CPU, console output \ No newline at end of file diff --git a/doc/Simulation.md b/doc/Simulation.md index 3b66a14f..dfd042bb 100644 --- a/doc/Simulation.md +++ b/doc/Simulation.md @@ -24,12 +24,27 @@ Running tests under specific drivers (rtlsim,simx,fpga) is done using the script - *L3cache* - used to enable the shared l3cache among the Vortex clusters. - *Driver* - used to specify which driver to run the Vortex simulation (either rtlsim, vlsim, fpga, or simx). - *Debug* - used to enable debug mode for the Vortex simulation. -- *Scope* - -- *Perf* - is used to enable the detailed performance counters within the Vortex simulation. -- *App* - is used to specify which test/benchmark to run in the Vortex simulation. The main choices are vecadd, sgemm, basic, demo, and dogfood. Other tests/benchmarks are located in the `/benchmarks/opencl` folder though not all of them work wit the current version of Vortex. -- *Args* - +- *Perf* - used to enable the detailed performance counters within the Vortex simulation. +- *App* - used to specify which test/benchmark to run in the Vortex simulation. The main choices are vecadd, sgemm, basic, demo, and dogfood. Other tests/benchmarks are located in the `/benchmarks/opencl` folder though not all of them work wit the current version of Vortex. +- *Args* - used to pass additional arguments to the application. Example use of command line arguments: Run the sgemm benchmark using the vlsim driver with a Vortex configuration of 1 cluster, 4 cores, 4 warps, and 4 threads. $ ./ci/blackbox.sh --clusters=1 --cores=4 --warps=4 --threads=4 --driver=vlsim --app=sgemm +Output from terminal: +``` +Create context +Create program from kernel source +Upload source buffers +Execute the kernel +Elapsed time: 2463 ms +Download destination buffer +Verify result +PASSED! +PERF: core0: instrs=90802, cycles=52776, IPC=1.720517 +PERF: core1: instrs=90693, cycles=53108, IPC=1.707709 +PERF: core2: instrs=90849, cycles=53107, IPC=1.710678 +PERF: core3: instrs=90836, cycles=50347, IPC=1.804199 +PERF: instrs=363180, cycles=53108, IPC=6.838518 +``` \ No newline at end of file diff --git a/doc/Vortex.md b/doc/Vortex.md new file mode 100644 index 00000000..f1d410a6 --- /dev/null +++ b/doc/Vortex.md @@ -0,0 +1,31 @@ +# Vortex Documentation + +### Table of Contents + +- [Vortex Codebase Layout](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Codebase.md) +- [Vortex Microarchitecture and Extended RISC-V ISA](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Microarchitecture.md) +- Vortex Software +- [Vortex Simulation](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Simulation.md) +- [FPGA Configuration, Program and Test](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Flubber_FPGA_Startup_Guide.md) +- Debugging +- Useful Links + +### Quick Start + +Setup Vortex environment: +``` +$ export RISCV_TOOLCHAIN_PATH=/opt/riscv-gnu-toolchain +$ export PATH=:/opt/verilator/bin:$PATH +$ export VERILATOR_ROOT=/opt/verilator +``` + +Test Vortex with different drivers and configurations: +- Run basic driver test with rtlsim driver and Vortex config of 2 clusters, 2 cores, 2 warps, 4 threads + + $ ./ci/blackbox.sh --clusters=2 --cores=2 --warps=2 --threads=4 --driver=rtlsim --app=basic +- Run demo driver test with vlsim driver and Vortex config of 1 clusters, 4 cores, 4 warps, 2 threads + + $ ./ci/blackbox.sh --clusters=1 --cores=4 --warps=4 --threads=2 --driver=vlsim --app=demo +- Run dogfood driver test with simx driver and Vortex config of 4 cluster, 4 cores, 8 warps, 6 threads + + $ ./ci/blackbox.sh --clusters=4 --cores=4 --warps=8 --threads=6 --driver=simx --app=dogfood \ No newline at end of file diff --git a/evaluation/scripts/README.txt b/evaluation/scripts/README.txt index 908b0147..dbf20831 100644 --- a/evaluation/scripts/README.txt +++ b/evaluation/scripts/README.txt @@ -5,19 +5,16 @@ Description: Makes the build in the opae directory with the specified core exists, a make clean command is ran before the build. Script waits until the inteldev script or quartus program is finished running. -Usage: ./build.sh -c [1|2|4|8|16] [-p perf] [-w wait] +Usage: ./build.sh -c [1|2|4|8|16] [-p [y|n]] Options: -c Core count (1, 2, 4, 8, or 16). -p - Performance profiling enable. Changes the source file in the + Performance profiling enable (y or n). Changes the source file in the opae directory to include/exclude "+define+PERF_ENABLE". - -w - Wait for the build to complete - _______________________________________________________________________________ @@ -27,6 +24,7 @@ Description: Runs build.sh with performance profiling enabled for all valid core configurations. _______________________________________________________________________________ +_______________________________________________________________________________ -program_fpga.sh- @@ -41,6 +39,7 @@ Options: Core count (1, 2, 4, 8, or 16). _______________________________________________________________________________ +_______________________________________________________________________________ -gather_perf_results.sh- @@ -65,3 +64,53 @@ _______________________________________________________________________________ Description: Programs fpga and runs gather_perf_results.sh for all valid core configurations. All builds should already be made before running this. + +_______________________________________________________________________________ +_______________________________________________________________________________ + + +-export_csv.sh- + +Description: Creates specified .csv output file from an input directory, file, +and parameter. The .csv file contains two columns: cores, and the input +parameter. The output file is located within the directory specified with -d. + +Usage: ./export_csv.sh -c [cores] -d [directory] -i [input filename] -o + [output filename] -p '[parameter]' + +Example: ./export_csv.sh -c 16 -d perf_2021_03_07 -i sgemm.result -o output.csv + -p 'PERF: scoreboard stalls' + +Options: + -c + Upper limit of cores to be read in. Core directories should exist in + the directory specified by -d e.g. 1c, 2c, 4c for -c 4. + + -d + The directory of the form perf_{date} located in the evaluation + directory. + + -i + The input filename located in each core directory within the + directory specified by -d. + + -o + The output filename to be created within the directory specified + by -d. + + -p + The parameter corresponding to the core count in the .csv file. The + full name of the parameter from the start of the line should be + inputted to avoid the parameter name being matched multiple times. + +_______________________________________________________________________________ + + +-export_ipc_csv.sh- + +Description: Runs export_csv.sh for the parameter IPC. + +Usage: ./export_csv.sh -c [cores] -d [directory] -i [input filename] -o + [output filename] + +Example: ./export_ipc.sh -c 16 -d perf_2021_03_07 -i sgemm.result -o output.csv diff --git a/evaluation/scripts/export_csv.sh b/evaluation/scripts/export_csv.sh new file mode 100755 index 00000000..8f95a71b --- /dev/null +++ b/evaluation/scripts/export_csv.sh @@ -0,0 +1,33 @@ +#!/bin/bash + +while getopts c:d:i:o:p: flag +do + case "${flag}" in + c) cores=${OPTARG};; #1, 2, 4, 8, 16 + d) dir=${OPTARG};; #directory name (e.g. perf_2021_03_07) + i) ifile=${OPTARG};; #input filename + o) ofile=${OPTARG};; #output filename + p) param=${OPTARG};; #parameter to be made into csv + esac +done + +if [[ ! "$cores" =~ ^(1|2|4|8|16)$ ]]; then + echo 'Invalid parameter for argument -c (1, 2, 4, 8, or 16 expected)' + exit 1 +fi + +if [ -z "$ifile" ]; then + echo 'No input filename given for argument -f' + exit 1 +fi + +if [ -z "$dir" ]; then + echo 'No directory given for argument -d' + exit 1 +fi + +printf "cores,${param}\n" > "../${dir}/${ofile}" +for ((i=1; i<=$cores; i=i*2)); do + printf "${i}," >> "../${dir}/${ofile}" + (sed -n "s/${param}=\(.*\)/\1/p" < "../${dir}/${i}c/${ifile}") >> "../${dir}/${ofile}" +done diff --git a/evaluation/scripts/export_ipc_csv.sh b/evaluation/scripts/export_ipc_csv.sh new file mode 100755 index 00000000..f698525b --- /dev/null +++ b/evaluation/scripts/export_ipc_csv.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +while getopts c:d:f:o: flag +do + case "${flag}" in + c) cores=${OPTARG};; #1, 2, 4, 8, 16 + d) dir=${OPTARG};; #directory name (e.g. perf_2021_03_07) + i) ifile=${OPTARG};; #input filename + o) ofile=${OPTARG};; #output filename + esac +done + +if [[ ! "$cores" =~ ^(1|2|4|8|16)$ ]]; then + echo 'Invalid parameter for argument -c (1, 2, 4, 8, or 16 expected)' + exit 1 +fi + +if [ -z "$ifile" ]; then + echo 'No input filename given for argument -f' + exit 1 +fi + +if [ -z "$dir" ]; then + echo 'No directory given for argument -d' + exit 1 +fi + +printf "cores,IPC" > "../${dir}/${ofile}" +for ((i=1; i<=$cores; i=i*2)); do + printf "${i}," >> "../${dir}/${ofile}" + (sed -n "s/IPC=\(.*\)/\1/p" < "../${dir}/${i}c/${ifile}" | awk 'END {print $NF}') >> "../${dir}/${ofile}" +done