-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clip : bring back GPU support #12322
Conversation
@@ -39,8 +39,15 @@ struct clip_image_f32_batch { | |||
size_t size; | |||
}; | |||
|
|||
CLIP_API struct clip_ctx * clip_model_load (const char * fname, int verbosity); | |||
CLIP_API struct clip_ctx * clip_model_load_cpu(const char * fname, int verbosity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that clip_model_load_cpu
has no implementation, so I removed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. The GPU backend initialization could be implemented using the backend registry with new_clip->backend = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_GPU, nullptr)
.
examples/llava/clip.cpp
Outdated
if (ctx_data) { | ||
ggml_free(ctx_data); | ||
} | ||
if (ctx_gguf) { | ||
gguf_free(ctx_gguf); | ||
} | ||
if (buf) { | ||
ggml_backend_buffer_free(buf); | ||
} | ||
if (backend) { | ||
ggml_backend_free(backend); | ||
} | ||
if (backend_cpu && backend_cpu != backend) { | ||
ggml_backend_free(backend_cpu); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these functions can be safely called with a null pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in d95c01a , the only check I keep is backend_cpu != backend
to prevent double-free
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested only on Mac as well.
Hey guys great work as always! very happy to see GPU support back to clip. just a friendly bump/ for reference these changes broke some model supports according to my tests: I've been trying this now and seems it breaks both MiniCPM works fine. Local logs:
|
ggml-org/llama.cpp#12322 (comment) Signed-off-by: Ettore Di Giacinto <[email protected]>
* feat(aio): update AIO image defaults cpu: - text-to-text: llama3.1 - embeddings: granite-embeddings - vision: moonream2 gpu/intel: - text-to-text: localai-functioncall-qwen2.5-7b-v0.5 - embeddings: granite-embeddings - vision: minicpm Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(aio): use minicpm as moondream2 stopped working ggml-org/llama.cpp#12322 (comment) Signed-off-by: Ettore Di Giacinto <[email protected]> --------- Signed-off-by: Ettore Di Giacinto <[email protected]>
* clip : bring back GPU support * use n_gpu_layers param * fix double free * ggml_backend_init_by_type * clean up
Is clip still broken on metal for qwen2vl or has that been fixed too? |
@mudler I think the CI is killed due to timeout, maybe you can try explicit set Unfortunately I don't have the capability to debug right now. Would be nice if someone can look in depth to see what is the cause problem. |
mmh. that gave me some hint actually, thank you! I've tested on GPU and it works, while on CPU it fails indeed. Probably related to the fact that
now defaults to using gpu. I'm fine specifying it by passing a context, but I wonder if would be safer to switching back to cpu as default? |
Hmm ok so that means it will stuck if there is no GPU at all. I'm not entirely sure why, but will try to compile llama.cpp without GPU to see.
|
yup can confirm that now works locally: I've replaced what I was using here (
And now successfully runs! I'm trying to have a look on how can we detect better if device in use is GPUs, for instance among the lines of Line 208 in e0dbec0
but maybe @slaren @ggerganov might have better ideas? I see in other spots of this PR reading the number of gpu layers to understand if we should use GPU or not, but afaict that seems something we can probably avoid as other code is probably capable of detecting devices in use (but I'm not sure about it, this is my speculation). llama.cpp so far behind the scenes managed to detect GPU or CPU devices without being explicit about it. For example, I use to specify gpu layers even if running on CPU, and that does not force llama.cpp to offload to gpu. |
Until a better solution is found upstream, be conservative and default to GPU. ggml-org/llama.cpp#12322 ggml-org/llama.cpp#12322 (comment) Signed-off-by: Ettore Di Giacinto <[email protected]>
The risk of enabling a GPU backend is that if does not support the operations, then the weights will still be stored in VRAM and will need to be copied back to CPU during evaluation, which of course is very slow. On a system without GPUs, the value of |
Please note that the point of having this detection (case of Even with And also remember that we will soon move away from this |
Right indeed 👍
I think that's the case here - ngl is !=0 and have no GPU,
Thanks for the heads up! |
* fix(clip): do not imply GPUs by default Until a better solution is found upstream, be conservative and default to GPU. ggml-org/llama.cpp#12322 ggml-org/llama.cpp#12322 (comment) Signed-off-by: Ettore Di Giacinto <[email protected]> * allow to override gpu via backend options Signed-off-by: Ettore Di Giacinto <[email protected]> --------- Signed-off-by: Ettore Di Giacinto <[email protected]>
* clip : bring back GPU support * use n_gpu_layers param * fix double free * ggml_backend_init_by_type * clean up
Motivation
Fix #11322
While waiting for #11292 , I think it will be beneficial to look back at the
clip
andllava
implementation and improve them a bit, as many downstream projects are depending on this.Some of my ideas:
So in this PR, I target the point (2) and (3) in my list above.
Please note that this implementation may not be perfect. My knowledge about ggml's
sched
is quite outdated, so the current status of this PR is "just work".How to test this
Download the GGUF from https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf (download text model + mmproj)
cmake --build build -j --target llama-minicpmv-cli ./build/bin/llama-minicpmv-cli -m ../models/minicpmv-Q2_K.gguf --mmproj ../models/minicpmv-mmproj.gguf --image ../models/bliss.png -p "what do you see?"
If you see
CLIP using Metal backend
, that means it's good:To disable GPU, set
-ngl 0
, output will be:Note: this is only tested on Metal