Skip to content

Releases: DefTruth/CUDA-Learn-Notes

v3.0.2

24 Feb 01:30
a9e2d17
Compare
Choose a tag to compare

What's Changed

New Contributors

  • @wplf made their first contribution in #247

Full Changelog: v3.0.1...v3.0.2

v3.0.1

06 Feb 12:08
ee9f706
Compare
Choose a tag to compare

What's Changed

Full Changelog: v3.0.0...v3.0.1

v3.0.0

22 Jan 10:08
7f35ae1
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.6.15...v3.0.0

🔥FFPA L1 release

08 Jan 03:38
62cb712
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.6.14...v2.6.15

QKV Fine-grained Tiling

03 Jan 08:51
82f1d04
Compare
Choose a tag to compare

What's Changed

  • [ELU] support ELU F32/F16 kernel✔️ by @southkarl in #194
  • [HARDSHRINK][FP16] support HARDSHRINK F32/FP16 kernel by @southkarl in #195
  • [swizzle] update smem swizzle layout tools✔️ by @DefTruth in #196
  • [swizzle] update smem swizzle layout tools✔️ by @DefTruth in #197
  • [swizzle] add padding -> swizzle layout tools🎉 by @DefTruth in #198
  • [HGEMM] HGEMM TN A&B SMEM Swizzle✔️ by @DefTruth in #199
  • [HGEMM] HGEMM TN A&B SMEM Swizzle✔️ by @DefTruth in #200
  • [FA2] shared-qkv + HMMA F32F16F16F32✔️ by @DefTruth in #201
  • [HARDSWISH] HARDSWISH F32/F16 kernel✔️ by @southkarl in #202
  • [FA2] kOStorageAccFloat32 flag -> shared-qkv✔️ by @DefTruth in #203
  • [FA2] kOStorageAccFloat32 -> share-qkv✔️ by @DefTruth in #204
  • [FA2] kOStorageAccFloat32 -> share-qkv✔️ by @DefTruth in #205
  • [FA2] share-kv + MMA F32F16F16F16F32✔️ by @DefTruth in #206
  • [FA2] share-kv + MMA F32F16F16F16F32✔️ by @DefTruth in #207
  • [FA2] tiling-kv + MMA F32F16F16F16F32✔️ by @DefTruth in #208
  • [FA2] tiling-qkv + MMA F32F16F16F32✔️ by @DefTruth in #209
  • [FA2] tiling-qkv + MMA F32F16F16F32✔️ by @DefTruth in #210
  • [FA2] flash-attn-mma fully tiling-qkv🎉 by @DefTruth in #211
  • [FA2] flash-attn-mma fully tiling-qkv🎉 by @DefTruth in #212
  • [FA2] tiling-qkv F32/F16 + swizzle q/qk/qkv🎉 by @DefTruth in #213

New Contributors

Full Changelog: v2.6.13...v2.6.14

FA2 QKV SMEM Swizzle✔️

28 Dec 05:38
b7966f0
Compare
Choose a tag to compare

What's Changed

  • [Doc] Refactor README for better readability✔️ by @DefTruth in #186
  • [FA2] shared-kv + fully smem swizzle✔️ by @DefTruth in #187
  • [FA2] tiling-qk + fully smem swizzle✔️ by @DefTruth in #188
  • [FA2] shared-qkv + fully swizzle q/qk/qkv✔️ by @DefTruth in #189
  • [FA2] hotfix launch setting -> shared-kv d=256✔️ by @DefTruth in #190
  • [FA2] support shared-qkv + O s2g kernel✔️ by @DefTruth in #191
  • [FA2] Update RTX 3080 Laptop performance✔️ by @DefTruth in #192
  • [Doc] Refactor README for better readability✔️ by @DefTruth in #193

Full Changelog: v2.6.12...v2.6.13

🎉FA2/HGEMM SMEM Swizzle

25 Dec 05:52
bdd361a
Compare
Choose a tag to compare

What's Changed

  • [FA2] split-q + tiling-qk D=512 performance🎉 by @DefTruth in #178
  • [FA2] split-q + tiling-qk D=512 performance🎉 by @DefTruth in #179
  • [FA2] split-q + tiling-qk D=512 performance🎉 by @DefTruth in #180
  • [Doc] Refactor README.md to improve readability✔️ by @DefTruth in #181
  • [Doc] Refactor README.md for better readability✔️ by @DefTruth in #182
  • [FA2] flash-attn-mma 3080/L20/4090 bench✔️ by @DefTruth in #183
  • [FA2] flash-attn-mma 3080/L20/4090 bench✔️ by @DefTruth in #184
  • [FA2] fa2/hgemm manually smem swizzle🎉 by @DefTruth in #185

flash_attn_mma_stages_split_q_tiling_qk_swizzle_kernel

void flash_attn_mma_stages_split_q_tiling_qk_swizzle_kernel<512, 16, 8, 16, 8, 1, 8, 1, 1, 16, 1, 64, 2, 0, 0, 8>(__half *, __half *, __half *, __half *, int, int) (8, 48, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
    Section: Command line profiler metrics
    ------------------------------------------------------------------ ----------- ------------
    Metric Name                                                        Metric Unit Metric Value
    ------------------------------------------------------------------ ----------- ------------
    sm__sass_l1tex_data_bank_conflicts_pipe_lsu_mem_shared_op_ldsm.avg                        0
    sm__sass_l1tex_data_bank_conflicts_pipe_lsu_mem_shared_op_ldsm.max                        0
    sm__sass_l1tex_data_bank_conflicts_pipe_lsu_mem_shared_op_ldsm.min                        0
    sm__sass_l1tex_data_bank_conflicts_pipe_lsu_mem_shared_op_ldsm.sum                        0
    ------------------------------------------------------------------ ----------- ------------

Full Changelog: v2.6.11...v2.6.12

📚 Split Q + QK Fine-grained Tiling

23 Dec 02:41
d474791
Compare
Choose a tag to compare

What's Changed

  • [FA2] split-q + tiling-qk D=512 performance🎉 by @DefTruth in #177

📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, Headdim -> 1024)

Currently, for small-scale attention (B<=4, H <=48, SeqLen <= 8192) can run faster than offical FA2/SDPA on some Devices. For example, on NVIDIA RTX 3080 Laptop, 📚 Split Q + Fully Shared QKV SMEM can achieve 55 TFLOPS (D=64) that almost ~1.5x 🎉 faster than FA2. Moreover, on NVIDIA L20, 📚 Split Q + QK Fine-grained Tiling can achieve 81 TFLOPS (D=512) that almost ~1.4x 🎉 faster than SDPA(EFFICIENT_ATTENTION). However, for large-scale attention, there remains a performance gap. Performance is continuously being optimized. Stay tuned for updates ~

  • Example: B=1, H=8, N=8192, D=64 (NVIDIA RTX 3080 Laptop), Faster than FA2~🎉🎉
python3 flash_attn_mma.py --B 1 --H 8 --D 64 --N 8192 --iters 10 --torch # NVIDIA RTX 3080 Laptop
-------------------------------------------B=1, H=8, N=8192, D=64, Warmup: 1, Iters: 10-------------------------------------------
                  torch(unfused): ['-0.00514603 ', '0.05783081  ', '-0.00026727 '], time:20.999861ms, TFLOPS:6.67 (+0.00%)
            mma(split-kv+stage1): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:5.120730ms, TFLOPS:27.36 (+310.10%)
            mma(split-kv+stage2): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:5.004287ms, TFLOPS:28.00 (+2.33%)
             mma(split-q+stage1): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:3.462291ms, TFLOPS:40.47 (+44.54%)
             mma(split-q+stage2): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:3.658915ms, TFLOPS:38.30
   mma(split-q+share-qkv+stage1): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:2.551699ms, TFLOPS:54.91 (+35.69%)
   mma(split-q+share-qkv+stage2): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:2.532172ms, TFLOPS:55.34 (+0.77%)
    mma(split-q+share-kv+stage1): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:2.776575ms, TFLOPS:50.46
    mma(split-q+share-kv+stage2): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:2.596927ms, TFLOPS:53.96
                         (flash): ['-0.00516129 ', '0.05783081  ', '-0.00027728 '], time:3.776550ms, TFLOPS:37.10
----------------------------------------------------------------------------------------------------------------------------------
  • Example: B=1, H=48, N=8192, D=512 (RTX 3080), FA2 not supported, QK Tiling Faster than SDPA~🎉🎉
python3 flash_attn_mma.py --B 1 --H 8 --N 8192 --iters 10 --show-all --sdpa --D 512 # NVIDIA RTX 3080 Laptop, Faster than SDPA
------------------------------------------B=1, H=8, N=8192, D=512, Warmup: 1, Iters: 10-------------------------------------------
   mma(split-q+tiling-qk+stage1): ['-0.00433731 ', '0.02165222  ', '-0.01544189 '], time:48.775554ms, TFLOPS:22.60 (+0.00%)
   mma(split-q+tiling-qk+stage2): ['-0.00433731 ', '0.02165222  ', '-0.01544189 '], time:47.503424ms, TFLOPS:23.20 (+2.68%)
                          (sdpa): ['-0.00438309 ', '0.02174377  ', '-0.01551056 '], time:66.486573ms, TFLOPS:16.58
----------------------------------------------------------------------------------------------------------------------------------
  • Example: B=1, H=48, N=8192, D=512 (NVIDIA L20), FA2 not supported, QK Tiling Faster than SDPA~🎉🎉
python3 flash_attn_mma.py --B 1 --H 48 --D 512 --N 16384 --show-all --check --iters 10 --sdpa
-----------------------------------------B=1, H=48, N=16384, D=512, Warmup: 1, Iters: 10------------------------------------------
   mma(split-q+tiling-qk+stage1): ['0.0079422   ', '-0.02334595 ', '0.00881958  '], time:387.384224ms, TFLOPS:68.28 (+0.00%)
   mma(split-q+tiling-qk+stage2): ['0.0079422   ', '-0.02334595 ', '0.00881958  '], time:325.593209ms, TFLOPS:81.24 (+18.98%)
                          (sdpa): ['0.00790405  ', '-0.02330017 ', '0.00875854  '], time:452.067018ms, TFLOPS:58.51
----------------------------------------------------------------------------------------------------------------------------------
  • 📚 Split Q + Fully Shared QKV SMEM (1/4 SRAM vs FA2)
// Q, K, V fully shared the same shared memory and prefetch Q s2r, improve block occupancy
// and reduce Q SMEM IO-Access.
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_shared_qkv_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, Headdim -> 1024)
// Fine-grained tiling at the MMA level for Q and K results in a constant SRAM usage of
// 64 * kMmaAtomK for Q and K. For V, the SRAM complexity is O(kMmaAtomK * d), leading to
// an overall SRAM complexity of O(kMmaAtomK * d). Consequently, this approach allows us to
// extend D (head dimension) up to 1024. Performance is stay tuned for updates ~
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half* O, ...);

Full Changelog: v2.6.10...v2.6.11

📚FA2: QK Fine-grained Tiling

22 Dec 07:58
697e06f
Compare
Choose a tag to compare

What's Changed

  • [FA2] hotfix flash-attn-mma smem size setting✔️ by @DefTruth in #170
  • [FA2] reorder grid layout, boost 5~10% TFLOPS✔️ by @DefTruth in #171
  • [FA2] optimize block tiling for headdim >= 128✔️ by @DefTruth in #172
  • [FA2] flash-attn-mma tiling-qk for large d⚡️ by @DefTruth in #173
  • [FA2] fix tiling-qk misaligned address✔️ by @DefTruth in #174
  • [README] Refactor README.md✔️ by @DefTruth in #175
  • [README] Refactor README✔️ by @DefTruth in #176

📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, Headdim -> 1024)

// Fine-grained tiling at the MMA level for Q and K results in a constant SRAM usage of
// 64 * kMmaAtomK for Q and K. For V, the SRAM complexity is O(kMmaAtomK * d), leading to
// an overall SRAM complexity of O(kMmaAtomK * d). Consequently, this approach allows us to
// extend D (head dimension) up to 1024. Performance is stay tuned for updates ~
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half* O, ...);

Full Changelog: v2.6.9...v2.6.10

FA2 Fully Shared QKV SMEM🎉

19 Dec 08:10
4687e1d
Compare
Choose a tag to compare

What's Changed

  • [FA2] Update flash-attn-mma shared-kv/qkv🎉 by @DefTruth in #163
  • [FA2] Update flash-attn-mma shared-kv/qkv🎉 by @DefTruth in #164
  • [FA2] Update flash-attn-mma shared-qkv🎉 by @DefTruth in #165
  • [FA2] Update flash-attn-mma shared-kv🎉 by @DefTruth in #166
  • [FA2] Update flash-attn-mma split-kv/q🎉 by @DefTruth in #167
  • [FA2] Update flash-attn-mma shared-qkv🎉 by @DefTruth in #168
  • [FA2] flash-attn-mma get rid of transpose-k✔️ by @DefTruth in #169

Full Changelog: v2.6.8...v2.6.9