Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clip : bring back GPU support #12322

Merged
merged 5 commits into from
Mar 11, 2025
Merged

clip : bring back GPU support #12322

merged 5 commits into from
Mar 11, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Mar 10, 2025

Motivation

Fix #11322

While waiting for #11292 , I think it will be beneficial to look back at the clip and llava implementation and improve them a bit, as many downstream projects are depending on this.

Some of my ideas:

  1. Format the code to make it looks nicer
  2. Use more STL and cpp features to ease the memory management
  3. Optimize the speed

So in this PR, I target the point (2) and (3) in my list above.

Please note that this implementation may not be perfect. My knowledge about ggml's sched is quite outdated, so the current status of this PR is "just work".

How to test this

Download the GGUF from https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf (download text model + mmproj)

cmake --build build -j --target llama-minicpmv-cli

./build/bin/llama-minicpmv-cli -m ../models/minicpmv-Q2_K.gguf --mmproj ../models/minicpmv-mmproj.gguf --image ../models/bliss.png -p "what do you see?"

If you see CLIP using Metal backend, that means it's good:

clip_init: CLIP using Metal backend
...
clip_init:      Metal compute buffer size =   102.80 MiB
clip_init:        CPU compute buffer size =    16.30 MiB
...
encode_image_with_clip: step 1 of 1 encoded in  1120.92 ms
encode_image_with_clip: all 1 segments encoded in  1120.95 ms
encode_image_with_clip: load_image_size 300 241
encode_image_with_clip: image embedding created: 64 tokens

To disable GPU, set -ngl 0, output will be:

clip_init: CLIP using CPU backend
...
clip_init:        CPU compute buffer size =   102.80 MiB
...
encode_image_with_clip: step 1 of 1 encoded in  6653.19 ms
encode_image_with_clip: all 1 segments encoded in  6653.21 ms
encode_image_with_clip: load_image_size 300 241
encode_image_with_clip: image embedding created: 64 tokens

Note: this is only tested on Metal

@ngxson ngxson marked this pull request as ready for review March 10, 2025 22:15
@ngxson ngxson requested review from ggerganov and slaren and removed request for ggerganov March 10, 2025 22:15
@@ -39,8 +39,15 @@ struct clip_image_f32_batch {
size_t size;
};

CLIP_API struct clip_ctx * clip_model_load (const char * fname, int verbosity);
CLIP_API struct clip_ctx * clip_model_load_cpu(const char * fname, int verbosity);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that clip_model_load_cpu has no implementation, so I removed it

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The GPU backend initialization could be implemented using the backend registry with new_clip->backend = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_GPU, nullptr).

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 10, 2025

@slaren Thanks for the clue, I implemented it in 3e45ea4

Comment on lines 644 to 658
if (ctx_data) {
ggml_free(ctx_data);
}
if (ctx_gguf) {
gguf_free(ctx_gguf);
}
if (buf) {
ggml_backend_buffer_free(buf);
}
if (backend) {
ggml_backend_free(backend);
}
if (backend_cpu && backend_cpu != backend) {
ggml_backend_free(backend_cpu);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these functions can be safely called with a null pointer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d95c01a , the only check I keep is backend_cpu != backend to prevent double-free

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested only on Mac as well.

@ngxson ngxson merged commit 96e1280 into ggml-org:master Mar 11, 2025
47 checks passed
@mudler
Copy link
Contributor

mudler commented Mar 12, 2025

Hey guys great work as always! very happy to see GPU support back to clip. just a friendly bump/ for reference these changes broke some model supports according to my tests:

I've been trying this now and seems it breaks both moondream2 (https://huggingface.co/moondream/moondream2-gguf/) and bakllava (which is tested by CI in LocalAI that automatically consumes new versions of llama.cpp mudler/LocalAI#4996 ).

MiniCPM works fine.

Local logs:

11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 12: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048                                                                                                11:56:25 [16/1945]
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 13: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 14: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 15: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 16: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 17: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 18: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 19: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 20: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 21: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 22: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 23: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init:        CPU KV buffer size =   384.00 MiB                                                                                                                            
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_init_from_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB                               
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_init_from_model:        CPU  output buffer size =     0.20 MiB                                                                             
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_init_from_model:        CPU compute buffer size =   160.01 MiB                                                                          
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_init_from_model: graph nodes  = 921
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_init_from_model: graph splits = 1
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: model name:   vikhyatk/moondream2
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: description:  image encoder for vikhyatk/moondream2
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: GGUF version: 3
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: alignment:    32
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: n_tensors:    457
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: n_kv:         19
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: ftype:        f16
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout 
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: loaded meta data with 19 key-value pairs and 457 tensors from /home/mudler/_git/LocalAI/models/moondream2-mmproj-f16.gguf
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   0:                       general.architecture str              = clip
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   1:                      clip.has_text_encoder bool             = false
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   2:                    clip.has_vision_encoder bool             = true
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   3:                   clip.has_llava_projector bool             = true
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   4:                          general.file_type u32              = 1
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   5:                               general.name str              = vikhyatk/moondream2
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   6:                        general.description str              = image encoder for vikhyatk/moondream2
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   7:                        clip.projector_type str              = mlp
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   8:                     clip.vision.image_size u32              = 378
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   9:                     clip.vision.patch_size u32              = 14
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  10:               clip.vision.embedding_length u32              = 1152
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  11:            clip.vision.feed_forward_length u32              = 4304
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  12:                 clip.vision.projection_dim u32              = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  13:           clip.vision.attention.head_count u32              = 16
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  15:                    clip.vision.block_count u32              = 28
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
11:56AM INF [llama-cpp] Loads OK
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  18:                              clip.use_gelu bool             = true

11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  18:                              clip.use_gelu bool             = true
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - type  f32:  285 tensors
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - type  f16:  172 tensors
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_ctx: CLIP using CPU backend
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: text_encoder:   0
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: vision_encoder: 1
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: llava_projector:  1
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: minicpmv_projector:  0
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: minicpmv_version:  2
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: glm_projector:  0
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: model size:     867.61 MB
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: metadata size:  0.16 MB
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: params backend buffer size =  867.61 MB (457 tensors)
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init:        CPU compute buffer size =    50.10 MiB
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout {"timestamp":1741776985,"level":"INFO","function":"initialize","line":570,"message":"initializing slots","n_slots":1}
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout {"timestamp":1741776985,"level":"INFO","function":"initialize","line":579,"message":"new slot","slot_id":0,"n_ctx_slot":2048}
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout {"timestamp":1741776985,"level":"INFO","function":"launch_slot_with_data","line":952,"message":"slot is processing task","slot_id":0,"task_id":0}
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout {"timestamp":1741776985,"level":"INFO","function":"update_slots","line":1884,"message":"kv cache rm [p0, end)","slot_id":0,"task_id":0,"p0":0}
[ Process gets killed ]

mudler added a commit to mudler/LocalAI that referenced this pull request Mar 12, 2025
mudler added a commit to mudler/LocalAI that referenced this pull request Mar 12, 2025
* feat(aio): update AIO image defaults

cpu:
 - text-to-text: llama3.1
 - embeddings: granite-embeddings
 - vision: moonream2

gpu/intel:
 - text-to-text: localai-functioncall-qwen2.5-7b-v0.5
 - embeddings: granite-embeddings
 - vision: minicpm

Signed-off-by: Ettore Di Giacinto <[email protected]>

* feat(aio): use minicpm as moondream2 stopped working

ggml-org/llama.cpp#12322 (comment)

Signed-off-by: Ettore Di Giacinto <[email protected]>

---------

Signed-off-by: Ettore Di Giacinto <[email protected]>
ishaangandhi pushed a commit to ishaangandhi/llama.cpp that referenced this pull request Mar 12, 2025
* clip : bring back GPU support

* use n_gpu_layers param

* fix double free

* ggml_backend_init_by_type

* clean up
@LostRuins
Copy link
Collaborator

Is clip still broken on metal for qwen2vl or has that been fixed too?

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 12, 2025

@mudler I think the CI is killed due to timeout, maybe you can try explicit set use_gpu to false via the newly added clip_init call.

Unfortunately I don't have the capability to debug right now. Would be nice if someone can look in depth to see what is the cause problem.

@mudler
Copy link
Contributor

mudler commented Mar 13, 2025

@mudler I think the CI is killed due to timeout, maybe you can try explicit set use_gpu to false via the newly added clip_init call.

Unfortunately I don't have the capability to debug right now. Would be nice if someone can look in depth to see what is the cause problem.

mmh. that gave me some hint actually, thank you! I've tested on GPU and it works, while on CPU it fails indeed.

Probably related to the fact that

struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
    return clip_init(fname, clip_context_params{
        /* use_gpu */   true,
        /* verbosity */ verbosity,
    });
}

now defaults to using gpu. I'm fine specifying it by passing a context, but I wonder if would be safer to switching back to cpu as default?

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 13, 2025

Hmm ok so that means it will stuck if there is no GPU at all. I'm not entirely sure why, but will try to compile llama.cpp without GPU to see.

use_gpu is supposed to be true by default, my idea was:

  • If there is a GPU, good, use it
  • If there is no GPU, fallback to CPU --> I think this is currently buggy, will have a look a bit later (or if someone can fix it, please give a try!)

@mudler
Copy link
Contributor

mudler commented Mar 13, 2025

yup can confirm that now works locally:

I've replaced what I was using here (clp_ctx = clip_model_load(params.mmproj.c_str(), /*verbosity=*/ 1);) with:

            clp_ctx = clip_init(params.mmproj.c_str(), clip_context_params {
                /* use_gpu */ false,
                /*verbosity=*/ 1,
            });

And now successfully runs!

I'm trying to have a look on how can we detect better if device in use is GPUs, for instance among the lines of

LLAMA_LOG_INFO("%s: using device %s (%s) - %zu MiB free\n", __func__, ggml_backend_dev_name(dev), ggml_backend_dev_description(dev), free/1024/1024);

but maybe @slaren @ggerganov might have better ideas?

I see in other spots of this PR reading the number of gpu layers to understand if we should use GPU or not, but afaict that seems something we can probably avoid as other code is probably capable of detecting devices in use (but I'm not sure about it, this is my speculation). llama.cpp so far behind the scenes managed to detect GPU or CPU devices without being explicit about it. For example, I use to specify gpu layers even if running on CPU, and that does not force llama.cpp to offload to gpu.

mudler added a commit to mudler/LocalAI that referenced this pull request Mar 13, 2025
Until a better solution is found upstream, be conservative and default
to GPU.

ggml-org/llama.cpp#12322
ggml-org/llama.cpp#12322 (comment)

Signed-off-by: Ettore Di Giacinto <[email protected]>
@slaren
Copy link
Member

slaren commented Mar 13, 2025

The risk of enabling a GPU backend is that if does not support the operations, then the weights will still be stored in VRAM and will need to be copied back to CPU during evaluation, which of course is very slow. On a system without GPUs, the value of use_gpu shouldn't change anything.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 13, 2025

I see in other spots of this PR reading the number of gpu layers to understand if we should use GPU or not, but afaict that seems something we can probably avoid as other code is probably capable of detecting devices in use (but I'm not sure about it, this is my speculation). llama.cpp so far behind the scenes managed to detect GPU or CPU devices without being explicit about it. For example, I use to specify gpu layers even if running on CPU, and that does not force llama.cpp to offload to gpu.

Please note that the point of having this detection (case of ngl == 0) is for convenient, as often user adding -ngl 0 will expect model to run on CPU.

Even with ngl != 0 and you run it without GPU, it should offload to CPU and runs normally, I think this is currently buggy and need to be fix.

And also remember that we will soon move away from this clip.cpp infrastructure, so I just want to keep thing simple here.

@mudler
Copy link
Contributor

mudler commented Mar 13, 2025

I see in other spots of this PR reading the number of gpu layers to understand if we should use GPU or not, but afaict that seems something we can probably avoid as other code is probably capable of detecting devices in use (but I'm not sure about it, this is my speculation). llama.cpp so far behind the scenes managed to detect GPU or CPU devices without being explicit about it. For example, I use to specify gpu layers even if running on CPU, and that does not force llama.cpp to offload to gpu.

Please note that the point of having this detection (case of ngl == 0) is for convenient, as often user adding -ngl 0 will expect model to run on CPU.

Right indeed 👍

Even with ngl != 0 and you run it without GPU, it should offload to CPU and runs normally, I think this is currently buggy and need to be fix.

I think that's the case here - ngl is !=0 and have no GPU, use_gpu is set to true (as default) and it crashes

And also remember that we will soon move away from this clip.cpp infrastructure, so I just want to keep thing simple here.

Thanks for the heads up!

mudler added a commit to mudler/LocalAI that referenced this pull request Mar 13, 2025
* fix(clip): do not imply GPUs by default

Until a better solution is found upstream, be conservative and default
to GPU.

ggml-org/llama.cpp#12322
ggml-org/llama.cpp#12322 (comment)

Signed-off-by: Ettore Di Giacinto <[email protected]>

* allow to override gpu via backend options

Signed-off-by: Ettore Di Giacinto <[email protected]>

---------

Signed-off-by: Ettore Di Giacinto <[email protected]>
jpohhhh pushed a commit to Telosnex/llama.cpp that referenced this pull request Mar 14, 2025
* clip : bring back GPU support

* use n_gpu_layers param

* fix double free

* ggml_backend_init_by_type

* clean up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: clip.cpp has no GPU support - a lot of work is at risk
5 participants