Experiments with GPU CUDA acceleration...sort of #220

Topping1 · 2022-12-06T02:30:44Z

CUDA toolkit documentation link states that NVBLAS is a drop-in BLAS replacement.
Also states: "The NVBLAS Library is a GPU-accelerated Library that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate most BLAS Level-3 routines by dynamically routing BLAS calls to one or more NVIDIA GPUs present in the system, when the charateristics of the call make it speed up on a GPU." One of those Level-3 routines is sgemm (matrix multiplication), that is used extensively by ggml.c.
In theory, IF CORRECTLY CONFIGURED, NVBLAS can intercept the calls to the OpenBLAS function cblas_sgemm and accelerate it using a CUDA compatible graphics card installed in the system.
There is not much information about the specific steps to enable it, but I could piece together this step-by-step:

1-Install CUDA toolkit from the official link link

2-create the file /etc/nvblas.conf with the following contents:

NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL

/usr/lib/x86_64-linux-gnu/libopenblas.so is the location of libopenblas.so on my system, You have to point it to the correct location (should not be that different).

3-create an environment variable pointing to nvblas.conf
export NVBLAS_CONFIG_FILE=/etc/nvblas.conf

4-create an environment variable pointing to the location of libnvblas.so
export LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so.11
here is not clear which .so file is needed. For example on my system I can find the following
/usr/local/cuda/lib64/libnvblas.so
/usr/local/cuda/lib64/libnvblas.so.11
/usr/local/cuda/lib64/libnvblas.so.11.11.3.6
/usr/local/cuda-11.8/lib64/libnvblas.so
/usr/local/cuda-11.8/lib64/libnvblas.so.11
/usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6
/usr/local/cuda-11.8/lib64/libnvblas.so
/usr/local/cuda-11.8/lib64/libnvblas.so.11
/usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6

5-Download source code of whisper.cpp with
git clone https://github.com/ggerganov/whisper.cpp

6-Inside the whisper.cpp folder, execute
cmake -DWHISPER_SUPPORT_OPENBLAS=ON .

7-Inside the whisper.cpp folder, execute
make
you should have now a compiled main executable with BLAS support turned on.

8-now, at least in my case, when I run a test transcription, the program confirms that is using BLAS (BLAS = 1), but NVBLAS does not seem to be intercepting the calls. NVTOP does not show GPU usage and no nvblas.log is created.

If someone can figure out how to make this work, it has the potential to accelerate substantially the transcription speed on x64.

The text was updated successfully, but these errors were encountered:

RYucel · 2022-12-06T04:49:47Z

It would be big development if solved. Thanks

…

On Tue, 6 Dec 2022, 04:30 Topping1, ***@***.***> wrote: CUDA toolkit documentation link <https://docs.nvidia.com/cuda/nvblas/index.html> states that NVBLAS is a drop-in BLAS replacement. Also states: "The NVBLAS Library is a GPU-accelerated Library that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate most BLAS Level-3 routines by dynamically routing BLAS calls to one or more NVIDIA GPUs present in the system, when the charateristics of the call make it speed up on a GPU." One of those Level-3 routines is sgemm (matrix multiplication), that is used extensively by ggml.c. In theory, IF CORRECTLY CONFIGURED, NVBLAS can intercept the calls to the OpenBLAS function *cblas_sgemm* and accelerate it using a CUDA compatible graphics card installed in the system. There is not much information about the specific steps to enable it, but I could piece together this step-by-step: 1-Install CUDA toolkit from the official link link <https://developer.nvidia.com/cuda-downloads> 2-create the file /etc/nvblas.conf with the following contents: NVBLAS_LOGFILE nvblas.log NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so NVBLAS_GPU_LIST ALL /usr/lib/x86_64-linux-gnu/libopenblas.so is the location of libopenblas.so on my system, You have to point it to the correct location (should not be that different). 3-create an environment variable pointing to nvblas.conf export NVBLAS_CONFIG_FILE=/etc/nvblas.conf 4-create an environment variable pointing to the location of libnvblas.so export LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so.11 here is not clear which .so file is needed. For example on my system I can find the following /usr/local/cuda/lib64/libnvblas.so /usr/local/cuda/lib64/libnvblas.so.11 /usr/local/cuda/lib64/libnvblas.so.11.11.3.6 /usr/local/cuda-11.8/lib64/libnvblas.so /usr/local/cuda-11.8/lib64/libnvblas.so.11 /usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6 /usr/local/cuda-11.8/lib64/libnvblas.so /usr/local/cuda-11.8/lib64/libnvblas.so.11 /usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6 5-Download source code of whisper.cpp with git clone https://github.com/ggerganov/whisper.cpp 6-Inside the whisper.cpp folder, execute cmake -DWHISPER_SUPPORT_OPENBLAS=ON . 7-Inside the whisper.cpp folder, execute make you should have now a compiled *main* executable with BLAS support turned on. 8-now, at least in my case, when I run a test transcription, the program confirms that is using BLAS (BLAS = 1), but NVBLAS does not seem to be intercepting the calls. NVTOP does not show GPU usage and no nvblas.log is created. If someone can figure out how to make this work, it has the potential to accelerate substantially the transcription speed on x64. — Reply to this email directly, view it on GitHub <#220>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF5JAEHJOTPBDCLYHT5RJBDWL2QOHANCNFSM6AAAAAASU66DXQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Topping1 · 2022-12-07T02:18:54Z

I have spotted this on the documentation as an alternative to intercepting OpenBLAS calls with the LD_PRELOAD environment variable:
To use the NVBLAS Library, the user application must be relinked against NVBLAS in addition to the original CPU Blas (technically only NVBLAS is needed unless some BLAS routines not supported by NVBLAS are used by the application). To be sure that the linker links against the exposed symbols of NVBLAS and not the ones from the CPU BLAS, the NVBLAS Library needs to be put before the CPU BLAS on the linkage command line.

@ggerganov, can you please advise how to link the shared library libnvblas.so so it is linked before OpenBLAS on the command line? Also, I'm not sure of where to apply this: ggml.c, whisper.cpp or main.cpp? Any help would be appreciated.

misutoneko · 2022-12-07T17:37:35Z

For that LD_PRELOAD trick, maybe that directory also needs to be added to LD_LIBRARY_PATH?
You can use LD_DEBUG=all to see what's going on in more detail.
To get a long list of options of what's possible, use LD_DEBUG=help cat

Btw the different library names that you see are mostly symlinked together, so it shouldn't matter much which one you choose.

ggerganov · 2022-12-08T20:15:03Z

@Topping1
I got it working and there is a significant performance boost when using libnvblas. Here are initial results:

You got everything correct, except it seems that the cblas_ calls are not intercepted by libnvblas. Instead we have to use the native Fortran BLAS API. Initial demonstration is available on the nvblas branch, so make sure to checkout the branch and rebuild.

Here is a comparison on a machine with GeForce GTX 1660, running the large model on jfk.wav:

Without libnvblas:

./bin/main -m ../models/ggml-large.bin ../samples/jfk.wav
whisper_model_load: loading model from '../models/ggml-large.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 5
whisper_model_load: adding 1608 extra tokens
whisper_model_load: mem_required  = 4712.00 MB
whisper_model_load: ggml ctx size = 2950.97 MB
whisper_model_load: memory size   =  304.38 MB
whisper_model_load: model size    = 2950.66 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  1538.72 ms
whisper_print_timings:      mel time =    53.92 ms
whisper_print_timings:   sample time =     8.39 ms
whisper_print_timings:   encode time = 12416.90 ms / 388.03 ms per layer
whisper_print_timings:   decode time =  1605.45 ms / 50.17 ms per layer
whisper_print_timings:    total time = 15623.76 ms

With libnvblas (using the LD_PRELOAD trick):

NVBLAS_CONFIG_FILE=/etc/nvblas.conf LD_PRELOAD=/usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvblas.so ./bin/main -m ../models/ggml-large.bin ../samples/jfk.wav
whisper_model_load: loading model from '../models/ggml-large.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 5
whisper_model_load: adding 1608 extra tokens
whisper_model_load: mem_required  = 4712.00 MB
whisper_model_load: ggml ctx size = 2950.97 MB
whisper_model_load: memory size   =  304.38 MB
whisper_model_load: model size    = 2950.66 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[NVBLAS] NVBLAS_CONFIG_FILE environment variable is set to '/etc/nvblas.conf'

[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  1535.95 ms
whisper_print_timings:      mel time =    52.52 ms
whisper_print_timings:   sample time =     7.75 ms
whisper_print_timings:   encode time =  7362.63 ms / 230.08 ms per layer
whisper_print_timings:   decode time =  1535.42 ms / 47.98 ms per layer
whisper_print_timings:    total time = 10494.65 ms

This shows that the Encoder is almost x2 faster - 12416.90 ms vs 7362.63 ms.

My /etc/nvblas.conf looks like this:

NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL
NVBLAS_TRACE_LOG_ENABLED

So I think with some fine-tuning and porting the appropriate matrix multiplications to libnvblas we can get decent GPU support. Likely, it will not be optimal, but hopefully decent. Thanks for this idea!

RYucel · 2022-12-09T07:44:44Z

Any windows executable soon for this?

…

On Thu, 8 Dec 2022, 22:15 Georgi Gerganov, ***@***.***> wrote: @Topping1 <https://github.com/Topping1> I got it working and there is a significant performance boost when using libnvblas. Here are initial results: You got everything correct, except it seems that the cblas_ calls are not intercepted by libnvblas. Instead we have to use the native Fortran BLAS API. Initial demonstration is available on the nvblas branch, so make sure to checkout the branch and rebuild. Here is a comparison on a machine with GeForce GTX 1660, running the large model on jfk.wav: - Without libnvblas: ./bin/main -m ../models/ggml-large.bin ../samples/jfk.wav whisper_model_load: loading model from '../models/ggml-large.bin' whisper_model_load: n_vocab = 51865 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 1280 whisper_model_load: n_audio_head = 20 whisper_model_load: n_audio_layer = 32 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 1280 whisper_model_load: n_text_head = 20 whisper_model_load: n_text_layer = 32 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 5 whisper_model_load: adding 1608 extra tokens whisper_model_load: mem_required = 4712.00 MB whisper_model_load: ggml ctx size = 2950.97 MB whisper_model_load: memory size = 304.38 MB whisper_model_load: model size = 2950.66 MB system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ... [00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country. whisper_print_timings: load time = 1538.72 ms whisper_print_timings: mel time = 53.92 ms whisper_print_timings: sample time = 8.39 ms whisper_print_timings: encode time = 12416.90 ms / 388.03 ms per layer whisper_print_timings: decode time = 1605.45 ms / 50.17 ms per layer whisper_print_timings: total time = 15623.76 ms - With libnvblas (using the LD_PRELOAD trick): NVBLAS_CONFIG_FILE=/etc/nvblas.conf LD_PRELOAD=/usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvblas.so ./bin/main -m ../models/ggml-large.bin ../samples/jfk.wav whisper_model_load: loading model from '../models/ggml-large.bin' whisper_model_load: n_vocab = 51865 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 1280 whisper_model_load: n_audio_head = 20 whisper_model_load: n_audio_layer = 32 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 1280 whisper_model_load: n_text_head = 20 whisper_model_load: n_text_layer = 32 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 5 whisper_model_load: adding 1608 extra tokens whisper_model_load: mem_required = 4712.00 MB whisper_model_load: ggml ctx size = 2950.97 MB whisper_model_load: memory size = 304.38 MB whisper_model_load: model size = 2950.66 MB system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ... [NVBLAS] NVBLAS_CONFIG_FILE environment variable is set to '/etc/nvblas.conf' [00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country. whisper_print_timings: load time = 1535.95 ms whisper_print_timings: mel time = 52.52 ms whisper_print_timings: sample time = 7.75 ms whisper_print_timings: encode time = 7362.63 ms / 230.08 ms per layer whisper_print_timings: decode time = 1535.42 ms / 47.98 ms per layer whisper_print_timings: total time = 10494.65 ms This shows that the Encoder is almost x2 faster - 12416.90 ms vs 7362.63 ms. My /etc/nvblas.conf looks like this: NVBLAS_LOGFILE nvblas.log NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so NVBLAS_GPU_LIST ALL NVBLAS_TRACE_LOG_ENABLED So I think with some fine-tuning and porting the appropriate matrix multiplications to libnvblas we can get decent GPU support. Likely, it will not be optimal, but hopefully decent. Thanks for this idea! — Reply to this email directly, view it on GitHub <#220 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF5JAEAOUKCGPOPGSM4P2I3WMI6VFANCNFSM6AAAAAASU66DXQ> . You are receiving this because you commented.Message ID: ***@***.***>

Topping1 · 2022-12-16T03:36:44Z

Any windows executable soon for this?

@RYucel After reading some documentation, it seems that this nvblas trick is only applicable to Linux. I think it can be implemented for Windows but with more changes to the code.

Topping1 · 2022-12-16T03:59:26Z

@Topping1 I got it working and there is a significant performance boost when using libnvblas. Here are initial results:

You got everything correct, except it seems that the cblas_ calls are not intercepted by libnvblas. Instead we have to use the native Fortran BLAS API. Initial demonstration is available on the nvblas branch, so make sure to checkout the branch and rebuild.

Here is a comparison on a machine with GeForce GTX 1660, running the large model on jfk.wav:

* Without `libnvblas`:

./bin/main -m ../models/ggml-large.bin ../samples/jfk.wav
whisper_model_load: loading model from '../models/ggml-large.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 5
whisper_model_load: adding 1608 extra tokens
whisper_model_load: mem_required  = 4712.00 MB
whisper_model_load: ggml ctx size = 2950.97 MB
whisper_model_load: memory size   =  304.38 MB
whisper_model_load: model size    = 2950.66 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  1538.72 ms
whisper_print_timings:      mel time =    53.92 ms
whisper_print_timings:   sample time =     8.39 ms
whisper_print_timings:   encode time = 12416.90 ms / 388.03 ms per layer
whisper_print_timings:   decode time =  1605.45 ms / 50.17 ms per layer
whisper_print_timings:    total time = 15623.76 ms

* With `libnvblas` (using the `LD_PRELOAD` trick):

NVBLAS_CONFIG_FILE=/etc/nvblas.conf LD_PRELOAD=/usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvblas.so ./bin/main -m ../models/ggml-large.bin ../samples/jfk.wav
whisper_model_load: loading model from '../models/ggml-large.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 5
whisper_model_load: adding 1608 extra tokens
whisper_model_load: mem_required  = 4712.00 MB
whisper_model_load: ggml ctx size = 2950.97 MB
whisper_model_load: memory size   =  304.38 MB
whisper_model_load: model size    = 2950.66 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[NVBLAS] NVBLAS_CONFIG_FILE environment variable is set to '/etc/nvblas.conf'

[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  1535.95 ms
whisper_print_timings:      mel time =    52.52 ms
whisper_print_timings:   sample time =     7.75 ms
whisper_print_timings:   encode time =  7362.63 ms / 230.08 ms per layer
whisper_print_timings:   decode time =  1535.42 ms / 47.98 ms per layer
whisper_print_timings:    total time = 10494.65 ms

This shows that the Encoder is almost x2 faster - 12416.90 ms vs 7362.63 ms.

My /etc/nvblas.conf looks like this:

NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL
NVBLAS_TRACE_LOG_ENABLED

So I think with some fine-tuning and porting the appropriate matrix multiplications to libnvblas we can get decent GPU support. Likely, it will not be optimal, but hopefully decent. Thanks for this idea!

@ggerganov thanks very much for your efforts!
I did some additional digging and found an interesting read here. It says:
Use of NVBLAS_AUTOPIN_MEM_ENABLED flag can be essential for good performance , something not obvious from the documentation.

Basically you have to add the line
NVBLAS_AUTOPIN_MEM_ENABLED
to the /etc/nvblas.conf file.

I ran bench on a Ryzen 5 PRO 2400G with a Quadro P1000 and got these results with a nvblas.conf:

NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL
NVBLAS_TRACE_LOG_ENABLED

results are:

whisper_print_timings:     load time =   219.68 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3044.64 ms / 507.44 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3264.37 ms

and with this nvblas.conf:

NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL
NVBLAS_TRACE_LOG_ENABLED
NVBLAS_AUTOPIN_MEM_ENABLED

results are:

whisper_print_timings:     load time =   230.46 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2176.08 ms / 362.68 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  2406.60 ms

2176.08 ms vs 3044.64 ms encode time...not bad at all.

On a related note, I think CLBlast has a similar way to call the matrix multiplication functions, and in this case the hardware acceleration would be via OpenCL. One downside is that you have to "tune" the installation for your particular GPU to get any decent speedup. I believe there was a mention here #173 . I got as far as optimizing for my GPU but there was errors installing the library. I will try again to see what is the performance difference with nvblas.

ggerganov · 2022-12-16T20:39:42Z

I tried using NVBLAS_AUTOPIN_MEM_ENABLED but it does not make a difference on my video card. Strange.

But in any case, I like this approach because it is very un-intrusive. Thinking what is the best way to integrate it in the build. Using the LD_PRELOAD trick is not very convenient - probably better link directly to libnvblas

Topping1 · 2022-12-23T02:13:24Z

I tried using NVBLAS_AUTOPIN_MEM_ENABLED but it does not make a difference on my video card. Strange.

But in any case, I like this approach because it is very un-intrusive. Thinking what is the best way to integrate it in the build. Using the LD_PRELOAD trick is not very convenient - probably better link directly to libnvblas

Strange that you don't get any speedup, but I guess it depends on the specific setup. I agree that linking directly to the libnvblas library is the way to go. The only inconvenience left would be to generate the nvblas.conf file, but I think that even that could be generated programmatically.
One behavior that I have noticed is that enabling nvblas improves the encoding time but degrades the decoding time compared to a build without BLAS support. I think it could be caused by other OpenBLAS calls in the ggml.c code.

esonec · 2023-01-05T16:41:54Z

How can I run Whisper on GPU on Windows 11 (CUDA) ?

Benoit9 · 2023-01-14T17:27:36Z

If you are interested in GPU acceleration with minimal code changes, I suggest you try CLBlast: https://github.com/CNugteren/CLBlast (and a nice presentation by its author: https://cnugteren.github.io/downloads/CLBlastIWOCL18.pdf)

All I had to do to use it in Whisper.cpp was:

Replace the openblas include in ggml.c:

//#include <cblas.h>
#include <clblast_netlib_c.h>

Change the Makefile to link with it:

ifdef WHISPER_OPENBLAS
	CFLAGS  += -DGGML_USE_OPENBLAS 
	LDFLAGS += -lclblast
endif

And on my Ubuntu laptop, install some Intel driver:

apt-get install intel-opencl-icd

I also had to install CLBlast from source with the -DNETLIB=ON cmake flag in order to get the clblast_netlib_c.h functionality.

With Intel Iris XE integrated graphics, I could match the performance of 8-core openblas with just one active core + GPU. With 4 cores, I got about 2x better performance. I am curious about what you would get with a Nvidia GPU?

And this is with the "netlib" bindings of CLBlast. As the author points out (https://github.com/CNugteren/CLBlast/blob/master/doc/bindings.md) it's for people who don't want to touch OpenCL, but comes with severe performance loss... Presumably the batched versions of gemm and proper buffer handling could be much faster.

misutoneko · 2023-01-15T23:38:39Z

@Benoit9, yep I gave CLBlast a try (very briefly) with my old GTX660 a while back.
I thought it might be a nice alternative for older GPUs, but it didn't work out very well in my case.
The memory requirements were the main problem, as I could only barely run the base model (and even that one was crashy).
So maybe it's better with a newer GPU with more memory.
I also found out that my elderly CPU has AVX support, so the performance problem kinda got solved for me that way...:D

Compaile · 2023-02-07T09:52:07Z

any more updates to this? running it on gpu would be very cool

joshuachris2001 · 2023-02-17T04:19:09Z

would this work on older nv gpus? for instance one that throws errors for whisper torch.
also does the gpu need to have the entire model in memory? (sorry I'm a noob for ML)

d3xMachina · 2023-03-09T14:07:26Z

It would be nice to have an implementation that also works on AMD like HERE. It's fast and use only 4GB with the large model on a RX 6800XT. I'm worried this repo won't be maintained thought.

rklasen · 2023-04-01T11:44:56Z

Does this work still in the current master branch? Inference on my CPU is quite slow, about 0.5 tokens/s on an AMD 3900X with the 7B model.

vricosti · 2023-04-03T11:06:37Z

I found in this ticket what I was talking about here: #713
This fragmentation around AI technology is a nonsense... CUDA, OpenCL, CoreML, you can find fork of whisper for each OS/hardware (original python, swift version with CoreML, DirectCompute, pure CPU).
Energy is wasted for nothing...
How can the situation be improved ?

Anyway I have also tested with CLBlast compiled locally and I get an exception inside:

// zT = y * xT
                cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                        ne11, ne01, ne10,
                        1.0f,    y, ne10,
                                 x, ne10,
                        0.0f,    d, ne01);

When I try to transcribe a simple wav file:

vricosti@iMac ~/Dev/Perso/jarvis/whisper.cpp/build/bin/Release
./main -m models/ggml-medium.bin -l fr -t 8 -f /Users/vricosti/Dev/Perso/jarvis/jarvis-open-ai/20230403_180429.wav
whisper_init_from_file_no_state: loading model from 'models/ggml-medium.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 4
whisper_model_load: mem required = 1725.00 MB (+ 43.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 1462.35 MB
whisper_model_load: model size = 1462.12 MB
whisper_init_state: kv self size = 42.00 MB
whisper_init_state: kv cross size = 140.62 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

main: processing '/Users/vricosti/Dev/Perso/jarvis/jarvis-open-ai/20230403_180429.wav' (80000 samples, 5.0 sec), 8 threads, 1 processors, lang = fr, task = transcribe, timestamps = 1 ...

CLBlast: OpenCL error: clBuildProgram: -11
libc++abi: terminating with uncaught exception of type std::runtime_error: CLBlast returned with error code -11
[1] 97401 abort ./main -m models/ggml-medium.bin -l fr -t 8 -f

And I am not very lucky with the stream app because sometimes it can recognize some words while my small custom app in python recognise a lot more words, so something is weird.

Anyone knows this: https://github.com/ROCm-Developer-Tools/HIP

ggerganov added the performance CPU and memory usage - results and comparisons label Dec 6, 2022

ggerganov mentioned this issue Dec 8, 2022

GPU support via NVBLAS #239

Closed

ggerganov pinned this issue Dec 16, 2022

framp mentioned this issue Feb 14, 2023

GPU support tazz4843/whisper-rs#12

Closed

ggerganov mentioned this issue Apr 13, 2023

Unexpected performance issue with longer prompts? ggml-org/llama.cpp#938

Closed

ggerganov mentioned this issue Apr 30, 2023

No cuBLAS performance gain for F16 ggml-org/llama.cpp#1249

Closed

ggerganov unpinned this issue May 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiments with GPU CUDA acceleration...sort of #220

Experiments with GPU CUDA acceleration...sort of #220

Topping1 commented Dec 6, 2022

RYucel commented Dec 6, 2022 via email

Topping1 commented Dec 7, 2022 •

edited

Loading

misutoneko commented Dec 7, 2022

ggerganov commented Dec 8, 2022 •

edited

Loading

RYucel commented Dec 9, 2022 via email

Topping1 commented Dec 16, 2022 •

edited

Loading

Topping1 commented Dec 16, 2022

ggerganov commented Dec 16, 2022

Topping1 commented Dec 23, 2022

esonec commented Jan 5, 2023

Benoit9 commented Jan 14, 2023 •

edited

Loading

misutoneko commented Jan 15, 2023 •

edited

Loading

Compaile commented Feb 7, 2023

joshuachris2001 commented Feb 17, 2023 •

edited

Loading

d3xMachina commented Mar 9, 2023

rklasen commented Apr 1, 2023

vricosti commented Apr 3, 2023 •

edited

Loading

Experiments with GPU CUDA acceleration...sort of #220

Experiments with GPU CUDA acceleration...sort of #220

Comments

Topping1 commented Dec 6, 2022

RYucel commented Dec 6, 2022 via email

Topping1 commented Dec 7, 2022 • edited Loading

misutoneko commented Dec 7, 2022

ggerganov commented Dec 8, 2022 • edited Loading

RYucel commented Dec 9, 2022 via email

Topping1 commented Dec 16, 2022 • edited Loading

Topping1 commented Dec 16, 2022

ggerganov commented Dec 16, 2022

Topping1 commented Dec 23, 2022

esonec commented Jan 5, 2023

Benoit9 commented Jan 14, 2023 • edited Loading

misutoneko commented Jan 15, 2023 • edited Loading

Compaile commented Feb 7, 2023

joshuachris2001 commented Feb 17, 2023 • edited Loading

d3xMachina commented Mar 9, 2023

rklasen commented Apr 1, 2023

vricosti commented Apr 3, 2023 • edited Loading

Topping1 commented Dec 7, 2022 •

edited

Loading

ggerganov commented Dec 8, 2022 •

edited

Loading

Topping1 commented Dec 16, 2022 •

edited

Loading

Benoit9 commented Jan 14, 2023 •

edited

Loading

misutoneko commented Jan 15, 2023 •

edited

Loading

joshuachris2001 commented Feb 17, 2023 •

edited

Loading

vricosti commented Apr 3, 2023 •

edited

Loading