clip : bring back GPU support #12322

ngxson · 2025-03-10T22:12:42Z

Motivation

While waiting for #11292 , I think it will be beneficial to look back at the clip and llava implementation and improve them a bit, as many downstream projects are depending on this.

Some of my ideas:

Format the code to make it looks nicer
Use more STL and cpp features to ease the memory management
Optimize the speed

So in this PR, I target the point (2) and (3) in my list above.

Please note that this implementation may not be perfect. My knowledge about ggml's sched is quite outdated, so the current status of this PR is "just work".

How to test this

Download the GGUF from https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf (download text model + mmproj)

cmake --build build -j --target llama-minicpmv-cli

./build/bin/llama-minicpmv-cli -m ../models/minicpmv-Q2_K.gguf --mmproj ../models/minicpmv-mmproj.gguf --image ../models/bliss.png -p "what do you see?"

If you see CLIP using Metal backend, that means it's good:

clip_init: CLIP using Metal backend
...
clip_init:      Metal compute buffer size =   102.80 MiB
clip_init:        CPU compute buffer size =    16.30 MiB
...
encode_image_with_clip: step 1 of 1 encoded in  1120.92 ms
encode_image_with_clip: all 1 segments encoded in  1120.95 ms
encode_image_with_clip: load_image_size 300 241
encode_image_with_clip: image embedding created: 64 tokens

To disable GPU, set -ngl 0, output will be:

clip_init: CLIP using CPU backend
...
clip_init:        CPU compute buffer size =   102.80 MiB
...
encode_image_with_clip: step 1 of 1 encoded in  6653.19 ms
encode_image_with_clip: all 1 segments encoded in  6653.21 ms
encode_image_with_clip: load_image_size 300 241
encode_image_with_clip: image embedding created: 64 tokens

Note: this is only tested on Metal

ngxson · 2025-03-10T22:16:30Z

examples/llava/clip.h

@@ -39,8 +39,15 @@ struct clip_image_f32_batch {
    size_t size;
 };

-CLIP_API struct clip_ctx * clip_model_load    (const char * fname, int verbosity);
-CLIP_API struct clip_ctx * clip_model_load_cpu(const char * fname, int verbosity);


Please note that clip_model_load_cpu has no implementation, so I removed it

slaren

Looks good. The GPU backend initialization could be implemented using the backend registry with new_clip->backend = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_GPU, nullptr).

ngxson · 2025-03-10T22:38:09Z

@slaren Thanks for the clue, I implemented it in 3e45ea4

slaren · 2025-03-10T22:43:33Z

examples/llava/clip.cpp

+        if (ctx_data) {
+            ggml_free(ctx_data);
+        }
+        if (ctx_gguf) {
+            gguf_free(ctx_gguf);
+        }
+        if (buf) {
+            ggml_backend_buffer_free(buf);
+        }
+        if (backend) {
+            ggml_backend_free(backend);
+        }
+        if (backend_cpu && backend_cpu != backend) {
+            ggml_backend_free(backend_cpu);
+        }


All of these functions can be safely called with a null pointer.

Fixed in d95c01a , the only check I keep is backend_cpu != backend to prevent double-free

examples/llava/clip.cpp

ggerganov

Tested only on Mac as well.

mudler · 2025-03-12T11:01:03Z

Hey guys great work as always! very happy to see GPU support back to clip. just a friendly bump/ for reference these changes broke some model supports according to my tests:

I've been trying this now and seems it breaks both moondream2 (https://huggingface.co/moondream/moondream2-gguf/) and bakllava (which is tested by CI in LocalAI that automatically consumes new versions of llama.cpp mudler/LocalAI#4996 ).

MiniCPM works fine.

Local logs:

11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 12: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048                                                                                                11:56:25 [16/1945]
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 13: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 14: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 15: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 16: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 17: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 18: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 19: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 20: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 21: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 22: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init: layer 23: n_embd_k_gqa = 2048, n_embd_v_gqa = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_kv_cache_init:        CPU KV buffer size =   384.00 MiB                                                                                                                            
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_init_from_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB                               
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_init_from_model:        CPU  output buffer size =     0.20 MiB                                                                             
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_init_from_model:        CPU compute buffer size =   160.01 MiB                                                                          
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_init_from_model: graph nodes  = 921
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr llama_init_from_model: graph splits = 1
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stderr common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: model name:   vikhyatk/moondream2
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: description:  image encoder for vikhyatk/moondream2
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: GGUF version: 3
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: alignment:    32
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: n_tensors:    457
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: n_kv:         19
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: ftype:        f16
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout 
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: loaded meta data with 19 key-value pairs and 457 tensors from /home/mudler/_git/LocalAI/models/moondream2-mmproj-f16.gguf
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   0:                       general.architecture str              = clip
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   1:                      clip.has_text_encoder bool             = false
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   2:                    clip.has_vision_encoder bool             = true
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   3:                   clip.has_llava_projector bool             = true
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   4:                          general.file_type u32              = 1
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   5:                               general.name str              = vikhyatk/moondream2
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   6:                        general.description str              = image encoder for vikhyatk/moondream2
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   7:                        clip.projector_type str              = mlp
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   8:                     clip.vision.image_size u32              = 378
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv   9:                     clip.vision.patch_size u32              = 14
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  10:               clip.vision.embedding_length u32              = 1152
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  11:            clip.vision.feed_forward_length u32              = 4304
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  12:                 clip.vision.projection_dim u32              = 2048
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  13:           clip.vision.attention.head_count u32              = 16
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  15:                    clip.vision.block_count u32              = 28
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
11:56AM INF [llama-cpp] Loads OK
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  18:                              clip.use_gelu bool             = true

11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - kv  18:                              clip.use_gelu bool             = true
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - type  f32:  285 tensors
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: - type  f16:  172 tensors
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_ctx: CLIP using CPU backend
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: text_encoder:   0
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: vision_encoder: 1
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: llava_projector:  1
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: minicpmv_projector:  0
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: minicpmv_version:  2
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: glm_projector:  0
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: model size:     867.61 MB
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: metadata size:  0.16 MB
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init: params backend buffer size =  867.61 MB (457 tensors)
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout clip_init:        CPU compute buffer size =    50.10 MiB
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout {"timestamp":1741776985,"level":"INFO","function":"initialize","line":570,"message":"initializing slots","n_slots":1}
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout {"timestamp":1741776985,"level":"INFO","function":"initialize","line":579,"message":"new slot","slot_id":0,"n_ctx_slot":2048}
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout {"timestamp":1741776985,"level":"INFO","function":"launch_slot_with_data","line":952,"message":"slot is processing task","slot_id":0,"task_id":0}
11:56AM DBG GRPC(moondream2-127.0.0.1:38099): stdout {"timestamp":1741776985,"level":"INFO","function":"update_slots","line":1884,"message":"kv cache rm [p0, end)","slot_id":0,"task_id":0,"p0":0}
[ Process gets killed ]

ggml-org/llama.cpp#12322 (comment) Signed-off-by: Ettore Di Giacinto <[email protected]>

* feat(aio): update AIO image defaults cpu: - text-to-text: llama3.1 - embeddings: granite-embeddings - vision: moonream2 gpu/intel: - text-to-text: localai-functioncall-qwen2.5-7b-v0.5 - embeddings: granite-embeddings - vision: minicpm Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(aio): use minicpm as moondream2 stopped working ggml-org/llama.cpp#12322 (comment) Signed-off-by: Ettore Di Giacinto <[email protected]> --------- Signed-off-by: Ettore Di Giacinto <[email protected]>

* clip : bring back GPU support * use n_gpu_layers param * fix double free * ggml_backend_init_by_type * clean up

LostRuins · 2025-03-12T16:01:01Z

Is clip still broken on metal for qwen2vl or has that been fixed too?

ngxson · 2025-03-12T16:55:28Z

@mudler I think the CI is killed due to timeout, maybe you can try explicit set use_gpu to false via the newly added clip_init call.

Unfortunately I don't have the capability to debug right now. Would be nice if someone can look in depth to see what is the cause problem.

mudler · 2025-03-13T11:23:38Z

@mudler I think the CI is killed due to timeout, maybe you can try explicit set use_gpu to false via the newly added clip_init call.

Unfortunately I don't have the capability to debug right now. Would be nice if someone can look in depth to see what is the cause problem.

mmh. that gave me some hint actually, thank you! I've tested on GPU and it works, while on CPU it fails indeed.

Probably related to the fact that

struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
    return clip_init(fname, clip_context_params{
        /* use_gpu */   true,
        /* verbosity */ verbosity,
    });
}

now defaults to using gpu. I'm fine specifying it by passing a context, but I wonder if would be safer to switching back to cpu as default?

ngxson · 2025-03-13T11:38:26Z

Hmm ok so that means it will stuck if there is no GPU at all. I'm not entirely sure why, but will try to compile llama.cpp without GPU to see.

use_gpu is supposed to be true by default, my idea was:

If there is a GPU, good, use it
If there is no GPU, fallback to CPU --> I think this is currently buggy, will have a look a bit later (or if someone can fix it, please give a try!)

mudler · 2025-03-13T11:42:01Z

yup can confirm that now works locally:

I've replaced what I was using here (clp_ctx = clip_model_load(params.mmproj.c_str(), /*verbosity=*/ 1);) with:

            clp_ctx = clip_init(params.mmproj.c_str(), clip_context_params {
                /* use_gpu */ false,
                /*verbosity=*/ 1,
            });

And now successfully runs!

I'm trying to have a look on how can we detect better if device in use is GPUs, for instance among the lines of

llama.cpp/src/llama.cpp

Line 208 in e0dbec0

    
           LLAMA_LOG_INFO("%s: using device %s (%s) - %zu MiB free\n", __func__, ggml_backend_dev_name(dev), ggml_backend_dev_description(dev), free/1024/1024);

but maybe @slaren @ggerganov might have better ideas?

I see in other spots of this PR reading the number of gpu layers to understand if we should use GPU or not, but afaict that seems something we can probably avoid as other code is probably capable of detecting devices in use (but I'm not sure about it, this is my speculation). llama.cpp so far behind the scenes managed to detect GPU or CPU devices without being explicit about it. For example, I use to specify gpu layers even if running on CPU, and that does not force llama.cpp to offload to gpu.

Until a better solution is found upstream, be conservative and default to GPU. ggml-org/llama.cpp#12322 ggml-org/llama.cpp#12322 (comment) Signed-off-by: Ettore Di Giacinto <[email protected]>

slaren · 2025-03-13T11:57:43Z

The risk of enabling a GPU backend is that if does not support the operations, then the weights will still be stored in VRAM and will need to be copied back to CPU during evaluation, which of course is very slow. On a system without GPUs, the value of use_gpu shouldn't change anything.

ngxson · 2025-03-13T12:01:50Z

I see in other spots of this PR reading the number of gpu layers to understand if we should use GPU or not, but afaict that seems something we can probably avoid as other code is probably capable of detecting devices in use (but I'm not sure about it, this is my speculation). llama.cpp so far behind the scenes managed to detect GPU or CPU devices without being explicit about it. For example, I use to specify gpu layers even if running on CPU, and that does not force llama.cpp to offload to gpu.

Please note that the point of having this detection (case of ngl == 0) is for convenient, as often user adding -ngl 0 will expect model to run on CPU.

Even with ngl != 0 and you run it without GPU, it should offload to CPU and runs normally, I think this is currently buggy and need to be fix.

And also remember that we will soon move away from this clip.cpp infrastructure, so I just want to keep thing simple here.

mudler · 2025-03-13T12:09:40Z

I see in other spots of this PR reading the number of gpu layers to understand if we should use GPU or not, but afaict that seems something we can probably avoid as other code is probably capable of detecting devices in use (but I'm not sure about it, this is my speculation). llama.cpp so far behind the scenes managed to detect GPU or CPU devices without being explicit about it. For example, I use to specify gpu layers even if running on CPU, and that does not force llama.cpp to offload to gpu.

Please note that the point of having this detection (case of ngl == 0) is for convenient, as often user adding -ngl 0 will expect model to run on CPU.

Right indeed 👍

Even with ngl != 0 and you run it without GPU, it should offload to CPU and runs normally, I think this is currently buggy and need to be fix.

I think that's the case here - ngl is !=0 and have no GPU, use_gpu is set to true (as default) and it crashes

And also remember that we will soon move away from this clip.cpp infrastructure, so I just want to keep thing simple here.

Thanks for the heads up!

* fix(clip): do not imply GPUs by default Until a better solution is found upstream, be conservative and default to GPU. ggml-org/llama.cpp#12322 ggml-org/llama.cpp#12322 (comment) Signed-off-by: Ettore Di Giacinto <[email protected]> * allow to override gpu via backend options Signed-off-by: Ettore Di Giacinto <[email protected]> --------- Signed-off-by: Ettore Di Giacinto <[email protected]>

* clip : bring back GPU support * use n_gpu_layers param * fix double free * ggml_backend_init_by_type * clean up

ngxson added 2 commits March 10, 2025 22:59

clip : bring back GPU support

bf1d72b

use n_gpu_layers param

9b83493

github-actions bot added the examples label Mar 10, 2025

fix double free

290a82d

ngxson marked this pull request as ready for review March 10, 2025 22:15

ngxson requested review from ggerganov and slaren and removed request for ggerganov March 10, 2025 22:15

ngxson commented Mar 10, 2025

View reviewed changes

ngxson mentioned this pull request Mar 10, 2025

Eval bug: clip.cpp has no GPU support - a lot of work is at risk #11322

Closed

slaren approved these changes Mar 10, 2025

View reviewed changes

ggml_backend_init_by_type

3e45ea4

slaren reviewed Mar 10, 2025

View reviewed changes

clean up

d95c01a

ggerganov approved these changes Mar 11, 2025

View reviewed changes

ngxson merged commit 96e1280 into ggml-org:master Mar 11, 2025
47 checks passed

ngxson mentioned this pull request Mar 12, 2025

ggml: offload the entire cgraph to a specified backend #12342

Closed

LoicDagnas mentioned this pull request Mar 12, 2025

March 2025 Update SciSharp/LLamaSharp#1126

Open

8 tasks

mudler added a commit to mudler/LocalAI that referenced this pull request Mar 12, 2025

feat(aio): use minicpm as moondream2 stopped working

4301940

ggml-org/llama.cpp#12322 (comment) Signed-off-by: Ettore Di Giacinto <[email protected]>

mudler mentioned this pull request Mar 12, 2025

feat(aio): update AIO image defaults mudler/LocalAI#5002

Merged

1 task

ishaangandhi pushed a commit to ishaangandhi/llama.cpp that referenced this pull request Mar 12, 2025

clip : bring back GPU support (ggml-org#12322)

ab0556e

* clip : bring back GPU support * use n_gpu_layers param * fix double free * ggml_backend_init_by_type * clean up

mudler mentioned this pull request Mar 13, 2025

fix(clip): do not imply GPU offload by default mudler/LocalAI#5010

Merged

1 task

jpohhhh pushed a commit to Telosnex/llama.cpp that referenced this pull request Mar 14, 2025

clip : bring back GPU support (ggml-org#12322)

823aa7b

* clip : bring back GPU support * use n_gpu_layers param * fix double free * ggml_backend_init_by_type * clean up

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clip : bring back GPU support #12322

clip : bring back GPU support #12322

ngxson commented Mar 10, 2025 •

edited

Loading

ngxson Mar 10, 2025

slaren left a comment

ngxson commented Mar 10, 2025

slaren Mar 10, 2025

ngxson Mar 10, 2025

ggerganov left a comment

mudler commented Mar 12, 2025 •

edited

Loading

LostRuins commented Mar 12, 2025

ngxson commented Mar 12, 2025

mudler commented Mar 13, 2025

ngxson commented Mar 13, 2025 •

edited

Loading

mudler commented Mar 13, 2025 •

edited

Loading

slaren commented Mar 13, 2025

ngxson commented Mar 13, 2025

mudler commented Mar 13, 2025 •

edited

Loading

clip : bring back GPU support #12322

clip : bring back GPU support #12322

Conversation

ngxson commented Mar 10, 2025 • edited Loading

Motivation

How to test this

ngxson Mar 10, 2025

Choose a reason for hiding this comment

slaren left a comment

Choose a reason for hiding this comment

ngxson commented Mar 10, 2025

slaren Mar 10, 2025

Choose a reason for hiding this comment

ngxson Mar 10, 2025

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

mudler commented Mar 12, 2025 • edited Loading

LostRuins commented Mar 12, 2025

ngxson commented Mar 12, 2025

mudler commented Mar 13, 2025

ngxson commented Mar 13, 2025 • edited Loading

mudler commented Mar 13, 2025 • edited Loading

slaren commented Mar 13, 2025

ngxson commented Mar 13, 2025

mudler commented Mar 13, 2025 • edited Loading

ngxson commented Mar 10, 2025 •

edited

Loading

mudler commented Mar 12, 2025 •

edited

Loading

ngxson commented Mar 13, 2025 •

edited

Loading

mudler commented Mar 13, 2025 •

edited

Loading

mudler commented Mar 13, 2025 •

edited

Loading