-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: query register count and use it in a better split_k heuristic #12319
base: master
Are you sure you want to change the base?
Conversation
I ran a few general tests on RTX 3090 and found that these changes generally leads to a negative performance delta in the test-backend-ops perf n=512 tests. This isn't by itself an issue, the tests might not be representative.
When testing models with llama-bench, I can see significant positive differences in pp for k-quants, but it seems that this change is quite negative for the legacy quants.
Any idea what's going on? |
Was this with coopmat2 or coopmat1? And was it comparing this PR against master or #12312? I'll try to reproduce it. |
Against master, and coopmat2. I tested only the last PR to check the combined difference, I didn't have time to narrow it down further. |
Looks like the regression is from #12258, which is a surprise because I had tested Q4_0 on a smaller model. I'll do some more testing, but I may just end up restoring the large tile size to what it was. |
I'm going to update #12258 with these tile sizes:
This restores the performance for pp128,256,512, but there are some slowdowns for odd sizes. But those are addressed by the other PRs. Here's what I'm seeing for the Q8_0 model when applied against this PR with all the changes:
|
Use VK_KHR_pipeline_executable_properties to query the register count, and use that to try to better estimate how many workgroups can fit in the SMs. Particularly with recent tile size changes (ggml-org#12258) the old heuristic is out of date. This heuristic benefits both coopmat1 and coopmat2 paths on NVIDIA. Would be good if somebody can hook up the missing details for other hardware. Calling getPipelineExecutableStatisticsKHR required more fully initializing Vulkan-HPP. The steps needed are documented in the Vulkan-HPP readme.
It's less pronounced now, but still very visible for legacy quants.
In perf the legacy quants are around coopmat1 performance. |
@0cc4m please try this diff:
I didn't expect the N/2 and N/4 paths to slow down the N path, but I think with these larger tile sizes we're at the register limit and the additional code perturbs the scheduling in a way that's hurting performance. I was able to see regressions in the directed tests from that change and this resolves it, but I'm still not confident I'm seeing all the same issues on Ada that you are on Ampere. If this doesn't fix it, then it's probably an issue with the split_k heuristic. |
That didn't change anything for me. |
I plugged in my RTX 3070 and I've still been unable to reproduce the slowdown. I'm getting very similar perf for the real models, and about a 10% speedup in the backend perf tests for Q4_0 and Q8_0. I tested with driver 572.63 from https://developer.nvidia.com/vulkan-driver. Can you confirm which commit causes the slowdown for you? And which driver are you using? |
This is stacked on #12312.
Use VK_KHR_pipeline_executable_properties to query the register count, and use that to try to better estimate how many workgroups can fit in the SMs. Particularly with recent tile size changes (#12258) the old heuristic is out of date.
This heuristic benefits both coopmat1 and coopmat2 paths on NVIDIA. Would be good if somebody can hook up the missing details for other hardware.
Calling getPipelineExecutableStatisticsKHR required more fully initializing Vulkan-HPP. The steps needed are documented in the Vulkan-HPP readme.
Results for Phi-3-mini-4k-instruct-q4.gguf on RTX 4070: