Releases · DefTruth/CUDA-Learn-Notes

void flash_attn_mma_stages_split_q_tiling_qk_swizzle_kernel<512, 16, 8, 16, 8, 1, 8, 1, 1, 16, 1, 64, 2, 0, 0, 8>(__half *, __half *, __half *, __half *, int, int) (8, 48, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
    Section: Command line profiler metrics
    ------------------------------------------------------------------ ----------- ------------
    Metric Name                                                        Metric Unit Metric Value
    ------------------------------------------------------------------ ----------- ------------
    sm__sass_l1tex_data_bank_conflicts_pipe_lsu_mem_shared_op_ldsm.avg                        0
    sm__sass_l1tex_data_bank_conflicts_pipe_lsu_mem_shared_op_ldsm.max                        0
    sm__sass_l1tex_data_bank_conflicts_pipe_lsu_mem_shared_op_ldsm.min                        0
    sm__sass_l1tex_data_bank_conflicts_pipe_lsu_mem_shared_op_ldsm.sum                        0
    ------------------------------------------------------------------ ----------- ------------

Full Changelog: v2.6.11...v2.6.12

Contributors

DefTruth

Assets 2

23 Dec 02:41

DefTruth

v2.6.11

d474791

📚 Split Q + QK Fine-grained Tiling

What's Changed

[FA2] split-q + tiling-qk D=512 performance🎉 by @DefTruth in #177

📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, `Headdim -> 1024`)

Currently, for small-scale attention (B<=4, H <=48, SeqLen <= 8192) can run faster than offical FA2/SDPA on some Devices. For example, on NVIDIA RTX 3080 Laptop, 📚 Split Q + Fully Shared QKV SMEM can achieve 55 TFLOPS (D=64) that almost ~1.5x 🎉 faster than FA2. Moreover, on NVIDIA L20, 📚 Split Q + QK Fine-grained Tiling can achieve 81 TFLOPS (D=512) that almost ~1.4x 🎉 faster than SDPA(EFFICIENT_ATTENTION). However, for large-scale attention, there remains a performance gap. Performance is continuously being optimized. Stay tuned for updates ~

Example: B=1, H=8, N=8192, D=64 (NVIDIA RTX 3080 Laptop), Faster than FA2~🎉🎉

python3 flash_attn_mma.py --B 1 --H 8 --D 64 --N 8192 --iters 10 --torch # NVIDIA RTX 3080 Laptop
-------------------------------------------B=1, H=8, N=8192, D=64, Warmup: 1, Iters: 10-------------------------------------------
                  torch(unfused): ['-0.00514603 ', '0.05783081  ', '-0.00026727 '], time:20.999861ms, TFLOPS:6.67 (+0.00%)
            mma(split-kv+stage1): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:5.120730ms, TFLOPS:27.36 (+310.10%)
            mma(split-kv+stage2): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:5.004287ms, TFLOPS:28.00 (+2.33%)
             mma(split-q+stage1): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:3.462291ms, TFLOPS:40.47 (+44.54%)
             mma(split-q+stage2): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:3.658915ms, TFLOPS:38.30
   mma(split-q+share-qkv+stage1): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:2.551699ms, TFLOPS:54.91 (+35.69%)
   mma(split-q+share-qkv+stage2): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:2.532172ms, TFLOPS:55.34 (+0.77%)
    mma(split-q+share-kv+stage1): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:2.776575ms, TFLOPS:50.46
    mma(split-q+share-kv+stage2): ['-0.00511169 ', '0.05795288  ', '-0.00029612 '], time:2.596927ms, TFLOPS:53.96
                         (flash): ['-0.00516129 ', '0.05783081  ', '-0.00027728 '], time:3.776550ms, TFLOPS:37.10
----------------------------------------------------------------------------------------------------------------------------------

Example: B=1, H=48, N=8192, D=512 (RTX 3080), FA2 not supported, QK Tiling Faster than SDPA~🎉🎉

python3 flash_attn_mma.py --B 1 --H 8 --N 8192 --iters 10 --show-all --sdpa --D 512 # NVIDIA RTX 3080 Laptop, Faster than SDPA
------------------------------------------B=1, H=8, N=8192, D=512, Warmup: 1, Iters: 10-------------------------------------------
   mma(split-q+tiling-qk+stage1): ['-0.00433731 ', '0.02165222  ', '-0.01544189 '], time:48.775554ms, TFLOPS:22.60 (+0.00%)
   mma(split-q+tiling-qk+stage2): ['-0.00433731 ', '0.02165222  ', '-0.01544189 '], time:47.503424ms, TFLOPS:23.20 (+2.68%)
                          (sdpa): ['-0.00438309 ', '0.02174377  ', '-0.01551056 '], time:66.486573ms, TFLOPS:16.58
----------------------------------------------------------------------------------------------------------------------------------

Example: B=1, H=48, N=8192, D=512 (NVIDIA L20), FA2 not supported, QK Tiling Faster than SDPA~🎉🎉

python3 flash_attn_mma.py --B 1 --H 48 --D 512 --N 16384 --show-all --check --iters 10 --sdpa
-----------------------------------------B=1, H=48, N=16384, D=512, Warmup: 1, Iters: 10------------------------------------------
   mma(split-q+tiling-qk+stage1): ['0.0079422   ', '-0.02334595 ', '0.00881958  '], time:387.384224ms, TFLOPS:68.28 (+0.00%)
   mma(split-q+tiling-qk+stage2): ['0.0079422   ', '-0.02334595 ', '0.00881958  '], time:325.593209ms, TFLOPS:81.24 (+18.98%)
                          (sdpa): ['0.00790405  ', '-0.02330017 ', '0.00875854  '], time:452.067018ms, TFLOPS:58.51
----------------------------------------------------------------------------------------------------------------------------------

📚 Split Q + Fully Shared QKV SMEM (1/4 SRAM vs FA2)

// Q, K, V fully shared the same shared memory and prefetch Q s2r, improve block occupancy
// and reduce Q SMEM IO-Access.
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_shared_qkv_kernel(half* Q, half* K, half* V, half* O, ...);

📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, Headdim -> 1024)

// Fine-grained tiling at the MMA level for Q and K results in a constant SRAM usage of
// 64 * kMmaAtomK for Q and K. For V, the SRAM complexity is O(kMmaAtomK * d), leading to
// an overall SRAM complexity of O(kMmaAtomK * d). Consequently, this approach allows us to
// extend D (head dimension) up to 1024. Performance is stay tuned for updates ~
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half* O, ...);

Full Changelog: v2.6.10...v2.6.11

Contributors

DefTruth

Assets 2

22 Dec 07:58

DefTruth

v2.6.10

697e06f

📚FA2: QK Fine-grained Tiling

What's Changed

[FA2] hotfix flash-attn-mma smem size setting✔️ by @DefTruth in #170
[FA2] reorder grid layout, boost 5~10% TFLOPS✔️ by @DefTruth in #171
[FA2] optimize block tiling for headdim >= 128✔️ by @DefTruth in #172
[FA2] flash-attn-mma tiling-qk for large d⚡️ by @DefTruth in #173
[FA2] fix tiling-qk misaligned address✔️ by @DefTruth in #174
[README] Refactor README.md✔️ by @DefTruth in #175
[README] Refactor README✔️ by @DefTruth in #176

📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, `Headdim -> 1024`)

// Fine-grained tiling at the MMA level for Q and K results in a constant SRAM usage of
// 64 * kMmaAtomK for Q and K. For V, the SRAM complexity is O(kMmaAtomK * d), leading to
// an overall SRAM complexity of O(kMmaAtomK * d). Consequently, this approach allows us to
// extend D (head dimension) up to 1024. Performance is stay tuned for updates ~
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half* O, ...);

Full Changelog: v2.6.9...v2.6.10

Contributors

DefTruth

Assets 2

19 Dec 08:10

DefTruth

v2.6.9

4687e1d

FA2 Fully Shared QKV SMEM🎉

What's Changed

[FA2] Update flash-attn-mma shared-kv/qkv🎉 by @DefTruth in #163
[FA2] Update flash-attn-mma shared-kv/qkv🎉 by @DefTruth in #164
[FA2] Update flash-attn-mma shared-qkv🎉 by @DefTruth in #165
[FA2] Update flash-attn-mma shared-kv🎉 by @DefTruth in #166
[FA2] Update flash-attn-mma split-kv/q🎉 by @DefTruth in #167
[FA2] Update flash-attn-mma shared-qkv🎉 by @DefTruth in #168
[FA2] flash-attn-mma get rid of transpose-k✔️ by @DefTruth in #169

Full Changelog: v2.6.8...v2.6.9

Contributors

DefTruth

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

flash_attn_mma_stages_split_q_tiling_qk_swizzle_kernel

Contributors

What's Changed

📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, `Headdim -> 1024`)

Contributors

What's Changed

📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, `Headdim -> 1024`)

Contributors

What's Changed

Contributors

Releases: DefTruth/CUDA-Learn-Notes

v3.0.2

What's Changed

New Contributors

Contributors

v3.0.1

What's Changed

Contributors

v3.0.0

What's Changed

Contributors

🔥FFPA L1 release

What's Changed

Contributors

QKV Fine-grained Tiling

What's Changed

New Contributors

Contributors

FA2 QKV SMEM Swizzle✔️

What's Changed

Contributors

🎉FA2/HGEMM SMEM Swizzle

What's Changed

flash_attn_mma_stages_split_q_tiling_qk_swizzle_kernel

Contributors

📚 Split Q + QK Fine-grained Tiling

What's Changed

📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, Headdim -> 1024)

Contributors

📚FA2: QK Fine-grained Tiling

What's Changed

📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, Headdim -> 1024)

Contributors

FA2 Fully Shared QKV SMEM🎉

What's Changed

Contributors

📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, `Headdim -> 1024`)

📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, `Headdim -> 1024`)