[CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x #12394

gaugarg-nv · 2025-03-14T15:45:49Z

cudaGraphExecUpdate API signature was changed in CTK 12.x. For this reason, CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA graph support on CTK version < 12.x by using older API if CTK < 12.x.

Performance Gains on CUDA 11.8, RTX 4090

This PR improves performance by around 35% in generation phase.

Master

llama-bench.exe -m DeepSeek-R1-Distill-Qwen-7B-GGUF\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |         pp512 |     10987.87 ± 29.37 |
| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |         tg128 |        110.47 ± 0.25 |

build: 8fcb5636 (4887)

llama-bench.exe -m DeepSeek-R1-Distill-Llama-8B-GGUF\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |         pp512 |    10345.30 ± 273.76 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |         tg128 |        109.59 ± 0.16 |

build: 8fcb5636 (4887)

This PR

llama-bench.exe -m DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |         pp512 |    10737.57 ± 247.49 |
| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |         tg128 |        153.02 ± 0.18 |

build: fc7f195c (4888)

llama-bench.exe -m DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |         pp512 |     10518.24 ± 45.75 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |         tg128 |        146.70 ± 0.26 |

build: fc7f195c (4888)

Make sure to read the contributing guidelines before submitting a PR

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.

gaugarg-nv · 2025-03-14T15:47:15Z

FYI @ggerganov @slaren @JohannesGaessler

Enable CUDA Graph on CTK < 12.x

fc7f195

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 14, 2025

gaugarg-nv changed the title ~~Enable CUDA Graph on CUDA Toolkit < 12.x~~ [CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x Mar 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x #12394

[CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x #12394

gaugarg-nv commented Mar 14, 2025 •

edited

Loading

gaugarg-nv commented Mar 14, 2025

[CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x #12394

Are you sure you want to change the base?

[CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x #12394

Conversation

gaugarg-nv commented Mar 14, 2025 • edited Loading

gaugarg-nv commented Mar 14, 2025

gaugarg-nv commented Mar 14, 2025 •

edited

Loading