Commit Graph

  • 1cfab40711 flash: Do Oi rescale with PV Hansung Kim 2024-08-30 20:09:10 -07:00
  • adf717eb14 common.mk: Embed operand.c to ELF, track header dependency Hansung Kim 2024-08-30 17:24:43 -07:00
  • 986d507223 flash: Fix single-tile GEMM for warp-specialized Hansung Kim 2024-08-30 17:12:46 -07:00
  • 72b6004e24 flash: Fix online softmax for warp-specialized Hansung Kim 2024-08-29 21:50:02 -07:00
  • ee0295cbef sgemm_impl: Accept threads_per_threadblock in load_tile_to_smem Hansung Kim 2024-08-29 21:43:57 -07:00
  • fd1ab358fa flash: Add DOUBLE_BUF compile-time param (wip) Hansung Kim 2024-08-29 14:18:32 -07:00
  • 5ba06dfd9d flash: Incomplete parallel stage-2 rowmax Hansung Kim 2024-08-29 13:29:00 -07:00
  • 4260bf7d6e Generate S matrix, pull out FA stuff from basic script Hansung Kim 2024-08-28 16:13:38 -07:00
  • 3f20dd59c0 flash: Supply correct tile dims to single_tile Hansung Kim 2024-08-20 19:50:45 -07:00
  • 091f40c365 sgemm_impl: Parameterize BM/BN/BK in single_tile Hansung Kim 2024-08-20 19:41:34 -07:00
  • dde0372769 flash: Enable skipping Q*K for larger dimensions Hansung Kim 2024-08-20 19:15:16 -07:00
  • 526c2bd334 sgemm_impl: load_tile: accept k_index for consistency + fix gmem addr gen Hansung Kim 2024-08-20 17:46:35 -07:00
  • 60aec1de8d flash.py: Fix row-wise scaling of O, col_to_save Hansung Kim 2024-08-20 14:49:25 -07:00
  • 615d36a5c2 flash: Reduce smem use for rowmax; verify result Hansung Kim 2024-08-20 14:34:45 -07:00
  • d8d5df64e6 flash: Fix load addr for V tile; test with seqlen=128 Hansung Kim 2024-08-20 14:34:09 -07:00
  • df3c41aa0d flash: data copy func for easy debugging Hansung Kim 2024-08-19 21:41:37 -07:00
  • 2f7fb372f1 Fix range for Hansung Kim 2024-08-19 21:19:16 -07:00
  • 09afd43904 More flash in generate_matrix Hansung Kim 2024-08-19 21:16:26 -07:00
  • 351e17c849 Separate golden script for flashattn Hansung Kim 2024-08-19 21:16:01 -07:00
  • 4080dec9d6 flash: Do exponential approx to rowsum and Oi as well Hansung Kim 2024-08-19 20:52:57 -07:00
  • f6cc61241b flash: 2nd-order taylor approx of exponential for P Hansung Kim 2024-08-19 20:12:24 -07:00
  • 68eb271916 Add operand.c to the link script Hansung Kim 2024-08-19 18:09:16 -07:00
  • 64e48de8af flash: Do accumulation of PV into O using the single_tile API Hansung Kim 2024-08-19 18:03:06 -07:00
  • 03c61d72ff sgemm_impl: Add param to load accumulation tile in single_tile Hansung Kim 2024-08-19 18:08:25 -07:00
  • 134ba825de sgemm_impl: Fix typo bug for BK_adjusted Hansung Kim 2024-08-19 18:02:00 -07:00
  • 3f4abc542c tensor: Fix dimensions and makefile Hansung Kim 2024-08-19 17:37:26 -07:00
  • a98da9e3ca flash: Add missing accum reg init and fix barrier count Hansung Kim 2024-08-19 16:15:46 -07:00
  • 7ac038fadf sgemm_impl: Rename initialize_C Hansung Kim 2024-08-19 16:12:35 -07:00
  • 4aba018733 sgemm_impl: Fix wrong barrier count; add barrier for write_to_smem Hansung Kim 2024-08-19 15:33:23 -07:00
  • e93e54cdec sgemm_impl: Drop volatile quanitifier Hansung Kim 2024-08-19 15:19:35 -07:00
  • 1e042af571 flash: Write and verify O = O + PV step Hansung Kim 2024-08-19 13:18:27 -07:00
  • 42ddb9a48e sgemm_impl: Accept layout template param at gemm_single_tile and wmma_load Hansung Kim 2024-08-19 13:16:22 -07:00
  • 1b133e7b5c sgemm_impl: Rename dmem load function Hansung Kim 2024-08-18 22:25:01 -07:00
  • 46b5047775 sgemm_impl: Remove GMEM_COALESCED_A option Hansung Kim 2024-08-18 22:21:17 -07:00
  • 04643fa64d sgemm_impl: Refactor dmem_load into one unified logic Hansung Kim 2024-08-18 20:21:23 -07:00
  • b44b202a21 sgemm_impl: Rename to wmma Hansung Kim 2024-08-18 16:21:22 -07:00
  • b978bf8757 sgemm_impl: Split tile offset addr gen from wmma store Hansung Kim 2024-08-18 16:10:29 -07:00
  • 90f6effa97 flash: Pass smem_P arg to softmax func Hansung Kim 2024-08-18 15:21:05 -07:00
  • d0809d292a sgemm: Specify A/B tile SMEM address via template args Hansung Kim 2024-08-16 16:27:35 -07:00
  • 64b9717064 sgemm_tcore: Remove duplicate float_type decl Hansung Kim 2024-08-16 16:26:18 -07:00
  • d3de1b674a flash: Compute exponents using prev/next/this rowmax values Hansung Kim 2024-08-15 22:09:13 -07:00
  • be08204e65 flash: Do proper allocation and init of QK/V/O tile Hansung Kim 2024-08-15 21:26:14 -07:00
  • 0ea27dd15a flash: gitignore Hansung Kim 2024-08-15 21:04:59 -07:00
  • e0daf226ef flash: Change kernel arg to contain qkv; strip stimulus gen from host code Hansung Kim 2024-08-15 21:03:02 -07:00
  • a1858e0c80 sgemm_impl: Parameterize BK/TCK by FP_SIZE Hansung Kim 2024-08-15 20:33:33 -07:00
  • fd2ff6208d Generate golden data for flash in generate_matrix.py Hansung Kim 2024-08-15 17:41:04 -07:00
  • ac44633b39 flash: Compile time flag for skipping GEMM Hansung Kim 2024-08-15 17:40:32 -07:00
  • f844d96eea flash: Initialize rowmax/rowsum cache in sharedmem Hansung Kim 2024-08-15 17:28:36 -07:00
  • 745aa098ed flash: Optimize spad use, fix rowsum Hansung Kim 2024-08-15 16:54:15 -07:00
  • e809d25305 flash: Fix rowsum and write fake exp Hansung Kim 2024-08-15 16:32:21 -07:00
  • 53dfc690b9 flash: Allocate smem properly for rowsum and scratch Hansung Kim 2024-08-14 21:50:20 -07:00
  • 9cabe3413b Fix overlapping smem in rowmax Hansung Kim 2024-08-14 21:09:47 -07:00
  • 692d028afd Add flash attention kernel skeleton Hansung Kim 2024-08-14 20:46:09 -07:00
  • 014f7cd06f sgemm_tcore: Unpack arg params, remove threadblock_dim_y Hansung Kim 2024-08-14 20:34:49 -07:00
  • 70919c39c9 Encode dependency to sgemm header in makefile Hansung Kim 2024-08-14 20:03:07 -07:00
  • 1b1264207b sgemm_tcore: Add compile-time write_to_gmem param to thread_block_gemm Hansung Kim 2024-08-14 17:48:31 -07:00
  • ee6339a35f sgemm_tcore: Split all impl code into sgemm_impl.hpp Hansung Kim 2024-08-14 16:24:48 -07:00
  • 0534e5d1f6 sgemm_tcore: Fix addr gen for GMEM->SMEM for M-major A Hansung Kim 2024-08-14 15:28:52 -07:00
  • 409424b032 sgemm_tcore: Fix fp16 addr gen in vx_wmma_load Hansung Kim 2024-08-13 14:34:03 -07:00
  • e69fbea83a sgemm_tcore: Fix casting error Hansung Kim 2024-08-12 17:57:50 -07:00
  • 95e3e96c6c tensor: Change B in-memory layout to column-major Hansung Kim 2024-08-12 15:20:55 -07:00
  • 07dd9e35a0 tensor: Fix dimensions for fp16 in script Hansung Kim 2024-08-12 15:20:27 -07:00
  • c1906ebb4f tensor: Embed binary instead of hardcoding literals Hansung Kim 2024-07-31 16:07:55 -07:00
  • 1b5daccac9 tensor: Generate fp16-packed matrix in script Hansung Kim 2024-07-31 16:07:03 -07:00
  • 4fddca3d1a fp16 kernel Richard Yan 2024-08-06 02:43:44 -07:00
  • ea4819702e oopsie doopsie Richard Yan 2024-08-06 02:43:27 -07:00
  • a12f2c296c tensor: Update readme Hansung Kim 2024-07-31 11:55:28 -07:00
  • 446b1a4c2e tensor: Add readme Hansung Kim 2024-07-31 11:52:30 -07:00
  • 285776404f tensor: Fix tensor unittest kernel Hansung Kim 2024-07-31 11:49:41 -07:00
  • 29f7290948 tensor: Fix correctness script Hansung Kim 2024-07-31 11:39:50 -07:00
  • 88cddc2b66 sgemm_tcore: Support data move for fp16-packed elements Hansung Kim 2024-07-30 18:07:34 -07:00
  • 7f26548724 sgemm_tcore: Fix mem addr stride to 4 Hansung Kim 2024-07-30 14:06:46 -07:00
  • 5d5a6fbad2 sgemm_tcore: Template-ize kernel code Hansung Kim 2024-07-29 20:05:58 -07:00
  • 5f342914bd sgemm_tcore: Support fp16 input generation in host code Hansung Kim 2024-07-29 17:18:35 -07:00
  • bca53a9c76 sgemm_tcore: Skip load at last k-iter; do DMA by default Hansung Kim 2024-07-19 16:37:51 -07:00
  • 1f844fa9e9 Set BM==BN==64, update doc Hansung Kim 2024-07-19 16:37:15 -07:00
  • 02feb36b12 idle: Use barriers instead to hang the core Hansung Kim 2024-06-22 01:37:00 -07:00
  • 11e6d34e1c Add idle kernel Hansung Kim 2024-06-20 14:00:32 -07:00
  • 63418a7496 sgemm_gemmini_dma: Skip mvout to scratchpad Hansung Kim 2024-06-19 20:49:44 -07:00
  • 12a96d9c16 Merge branch 'kernels' of https://github.com/hansungk/vortex-private into kernels Richard Yan 2024-06-19 17:46:24 -07:00
  • a1e165724f skip move to spad Richard Yan 2024-06-19 17:45:58 -07:00
  • c06cc40e59 make non dma gemmini use 64x64 tile size Richard Yan 2024-06-19 17:45:01 -07:00
  • bebdd3353e Use SWISH in activate_block for tcore and gemmini Hansung Kim 2024-06-19 15:41:50 -07:00
  • ae9e707280 sgemm_{gemmini_dma,tcore}: Separate activate_block Hansung Kim 2024-06-18 17:59:46 -07:00
  • b586e0f881 sgemm_gemmini_dma: Update activation to match tcore Hansung Kim 2024-06-18 15:30:12 -07:00
  • 50b843d8c4 sgemm_tcore: Fix address overlap for DMA Hansung Kim 2024-06-18 15:06:07 -07:00
  • 36b02ad595 sgemm_tcore: Add warp-specialized kernel with activations Hansung Kim 2024-06-17 19:14:33 -07:00
  • 1a44063c5d sgemm_gemmini_dma: Initial activation kernel with gemmini+DMA Hansung Kim 2024-06-17 16:56:29 -07:00
  • 85cace9524 sgemm_tcore: Fix smem allocation for non-dma Hansung Kim 2024-06-15 01:28:27 -07:00
  • cfb6ae4a91 sgemm_tcore: Fix wrong double-buf addr for wmma_load Hansung Kim 2024-06-15 00:51:35 -07:00
  • 9d6ff196b3 sgemm_tcore: Use old opcodes to match frozen rtl Hansung Kim 2024-06-15 00:26:57 -07:00
  • 095ccfd79a sgemm_gemmini_duo: Check in serialized kernel as separate file Hansung Kim 2024-06-12 22:43:36 -07:00
  • 1f26b4ef10 Remove checked in binary Hansung Kim 2024-06-12 22:03:29 -07:00
  • 95e9adb2d0 sgemm_gemmini_duo: Fix device addr in main.cpp Hansung Kim 2024-06-12 21:57:24 -07:00
  • f5d82f85e5 sgemm_gemmini_duo: Split per-gemmini code to function Hansung Kim 2024-06-12 21:17:03 -07:00
  • ce4f3a24e3 sgemm_tcore: Replace hardcoded NUM_LANES with NUM_THREADS Hansung Kim 2024-06-12 21:01:37 -07:00
  • 91efc0fc14 Check in VX_config.h with 4core/8warp/8threads default Hansung Kim 2024-06-12 20:52:08 -07:00
  • 21452661f2 sgemm_tcore: Fix double-buffered addr for GEMMINI_DMA Hansung Kim 2024-06-12 00:48:56 -07:00
  • 635da96154 sgemm_tcore: Constify smem pointer for wmma_load Hansung Kim 2024-06-11 22:49:59 -07:00
  • f73029889b oopsie Richard Yan 2024-06-12 13:34:19 -07:00