-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiments with GPU CUDA acceleration...sort of #220
Comments
It would be big development if solved. Thanks
…On Tue, 6 Dec 2022, 04:30 Topping1, ***@***.***> wrote:
CUDA toolkit documentation link
<https://docs.nvidia.com/cuda/nvblas/index.html> states that NVBLAS is a
drop-in BLAS replacement.
Also states: "The NVBLAS Library is a GPU-accelerated Library that
implements BLAS (Basic Linear Algebra Subprograms). It can accelerate most
BLAS Level-3 routines by dynamically routing BLAS calls to one or more
NVIDIA GPUs present in the system, when the charateristics of the call make
it speed up on a GPU." One of those Level-3 routines is sgemm (matrix
multiplication), that is used extensively by ggml.c.
In theory, IF CORRECTLY CONFIGURED, NVBLAS can intercept the calls to the
OpenBLAS function *cblas_sgemm* and accelerate it using a CUDA compatible
graphics card installed in the system.
There is not much information about the specific steps to enable it, but I
could piece together this step-by-step:
1-Install CUDA toolkit from the official link link
<https://developer.nvidia.com/cuda-downloads>
2-create the file /etc/nvblas.conf with the following contents:
NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL
/usr/lib/x86_64-linux-gnu/libopenblas.so is the location of libopenblas.so
on my system, You have to point it to the correct location (should not be
that different).
3-create an environment variable pointing to nvblas.conf
export NVBLAS_CONFIG_FILE=/etc/nvblas.conf
4-create an environment variable pointing to the location of libnvblas.so
export LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so.11
here is not clear which .so file is needed. For example on my system I can
find the following
/usr/local/cuda/lib64/libnvblas.so
/usr/local/cuda/lib64/libnvblas.so.11
/usr/local/cuda/lib64/libnvblas.so.11.11.3.6
/usr/local/cuda-11.8/lib64/libnvblas.so
/usr/local/cuda-11.8/lib64/libnvblas.so.11
/usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6
/usr/local/cuda-11.8/lib64/libnvblas.so
/usr/local/cuda-11.8/lib64/libnvblas.so.11
/usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6
5-Download source code of whisper.cpp with
git clone https://github.com/ggerganov/whisper.cpp
6-Inside the whisper.cpp folder, execute
cmake -DWHISPER_SUPPORT_OPENBLAS=ON .
7-Inside the whisper.cpp folder, execute
make
you should have now a compiled *main* executable with BLAS support turned
on.
8-now, at least in my case, when I run a test transcription, the program
confirms that is using BLAS (BLAS = 1), but NVBLAS does not seem to be
intercepting the calls. NVTOP does not show GPU usage and no nvblas.log is
created.
If someone can figure out how to make this work, it has the potential to
accelerate substantially the transcription speed on x64.
—
Reply to this email directly, view it on GitHub
<#220>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF5JAEHJOTPBDCLYHT5RJBDWL2QOHANCNFSM6AAAAAASU66DXQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I have spotted this on the documentation as an alternative to intercepting OpenBLAS calls with the LD_PRELOAD environment variable: @ggerganov, can you please advise how to link the shared library libnvblas.so so it is linked before OpenBLAS on the command line? Also, I'm not sure of where to apply this: ggml.c, whisper.cpp or main.cpp? Any help would be appreciated. |
For that LD_PRELOAD trick, maybe that directory also needs to be added to LD_LIBRARY_PATH? Btw the different library names that you see are mostly symlinked together, so it shouldn't matter much which one you choose. |
@Topping1 You got everything correct, except it seems that the Here is a comparison on a machine with
This shows that the Encoder is almost x2 faster - My
So I think with some fine-tuning and porting the appropriate matrix multiplications to |
Any windows executable soon for this?
…On Thu, 8 Dec 2022, 22:15 Georgi Gerganov, ***@***.***> wrote:
@Topping1 <https://github.com/Topping1>
I got it working and there is a significant performance boost when using
libnvblas. Here are initial results:
You got everything correct, except it seems that the cblas_ calls are not
intercepted by libnvblas. Instead we have to use the native Fortran BLAS
API. Initial demonstration is available on the nvblas branch, so make
sure to checkout the branch and rebuild.
Here is a comparison on a machine with GeForce GTX 1660, running the large
model on jfk.wav:
- Without libnvblas:
./bin/main -m ../models/ggml-large.bin ../samples/jfk.wav
whisper_model_load: loading model from '../models/ggml-large.bin'
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 5
whisper_model_load: adding 1608 extra tokens
whisper_model_load: mem_required = 4712.00 MB
whisper_model_load: ggml ctx size = 2950.97 MB
whisper_model_load: memory size = 304.38 MB
whisper_model_load: model size = 2950.66 MB
system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 |
main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 1538.72 ms
whisper_print_timings: mel time = 53.92 ms
whisper_print_timings: sample time = 8.39 ms
whisper_print_timings: encode time = 12416.90 ms / 388.03 ms per layer
whisper_print_timings: decode time = 1605.45 ms / 50.17 ms per layer
whisper_print_timings: total time = 15623.76 ms
- With libnvblas (using the LD_PRELOAD trick):
NVBLAS_CONFIG_FILE=/etc/nvblas.conf LD_PRELOAD=/usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvblas.so ./bin/main -m ../models/ggml-large.bin ../samples/jfk.wav
whisper_model_load: loading model from '../models/ggml-large.bin'
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 5
whisper_model_load: adding 1608 extra tokens
whisper_model_load: mem_required = 4712.00 MB
whisper_model_load: ggml ctx size = 2950.97 MB
whisper_model_load: memory size = 304.38 MB
whisper_model_load: model size = 2950.66 MB
system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 |
main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[NVBLAS] NVBLAS_CONFIG_FILE environment variable is set to '/etc/nvblas.conf'
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 1535.95 ms
whisper_print_timings: mel time = 52.52 ms
whisper_print_timings: sample time = 7.75 ms
whisper_print_timings: encode time = 7362.63 ms / 230.08 ms per layer
whisper_print_timings: decode time = 1535.42 ms / 47.98 ms per layer
whisper_print_timings: total time = 10494.65 ms
This shows that the Encoder is almost x2 faster - 12416.90 ms vs 7362.63
ms.
My /etc/nvblas.conf looks like this:
NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL
NVBLAS_TRACE_LOG_ENABLED
So I think with some fine-tuning and porting the appropriate matrix
multiplications to libnvblas we can get decent GPU support. Likely, it
will not be optimal, but hopefully decent. Thanks for this idea!
—
Reply to this email directly, view it on GitHub
<#220 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF5JAEAOUKCGPOPGSM4P2I3WMI6VFANCNFSM6AAAAAASU66DXQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@RYucel After reading some documentation, it seems that this nvblas trick is only applicable to Linux. I think it can be implemented for Windows but with more changes to the code. |
@ggerganov thanks very much for your efforts! Basically you have to add the line I ran
results are:
and with this nvblas.conf:
results are:
2176.08 ms vs 3044.64 ms encode time...not bad at all. On a related note, I think CLBlast has a similar way to call the matrix multiplication functions, and in this case the hardware acceleration would be via OpenCL. One downside is that you have to "tune" the installation for your particular GPU to get any decent speedup. I believe there was a mention here #173 . I got as far as optimizing for my GPU but there was errors installing the library. I will try again to see what is the performance difference with nvblas. |
I tried using But in any case, I like this approach because it is very un-intrusive. Thinking what is the best way to integrate it in the build. Using the LD_PRELOAD trick is not very convenient - probably better link directly to |
Strange that you don't get any speedup, but I guess it depends on the specific setup. I agree that linking directly to the |
How can I run Whisper on GPU on Windows 11 (CUDA) ? |
If you are interested in GPU acceleration with minimal code changes, I suggest you try CLBlast: https://github.com/CNugteren/CLBlast (and a nice presentation by its author: https://cnugteren.github.io/downloads/CLBlastIWOCL18.pdf) All I had to do to use it in Whisper.cpp was: Replace the openblas include in
Change the Makefile to link with it:
And on my Ubuntu laptop, install some Intel driver:
I also had to install CLBlast from source with the With Intel Iris XE integrated graphics, I could match the performance of 8-core openblas with just one active core + GPU. With 4 cores, I got about 2x better performance. I am curious about what you would get with a Nvidia GPU? And this is with the "netlib" bindings of CLBlast. As the author points out (https://github.com/CNugteren/CLBlast/blob/master/doc/bindings.md) it's for people who don't want to touch OpenCL, but comes with severe performance loss... Presumably the batched versions of gemm and proper buffer handling could be much faster. |
@Benoit9, yep I gave CLBlast a try (very briefly) with my old GTX660 a while back. |
any more updates to this? running it on gpu would be very cool |
would this work on older nv gpus? for instance one that throws errors for whisper torch. |
It would be nice to have an implementation that also works on AMD like HERE. It's fast and use only 4GB with the large model on a RX 6800XT. I'm worried this repo won't be maintained thought. |
Does this work still in the current master branch? Inference on my CPU is quite slow, about 0.5 tokens/s on an AMD 3900X with the 7B model. |
I found in this ticket what I was talking about here: #713 Anyway I have also tested with CLBlast compiled locally and I get an exception inside:
When I try to transcribe a simple wav file: vricosti@iMac ~/Dev/Perso/jarvis/whisper.cpp/build/bin/Release system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | main: processing '/Users/vricosti/Dev/Perso/jarvis/jarvis-open-ai/20230403_180429.wav' (80000 samples, 5.0 sec), 8 threads, 1 processors, lang = fr, task = transcribe, timestamps = 1 ... CLBlast: OpenCL error: clBuildProgram: -11 And I am not very lucky with the stream app because sometimes it can recognize some words while my small custom app in python recognise a lot more words, so something is weird. Anyone knows this: https://github.com/ROCm-Developer-Tools/HIP |
CUDA toolkit documentation link states that NVBLAS is a drop-in BLAS replacement.
Also states: "The NVBLAS Library is a GPU-accelerated Library that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate most BLAS Level-3 routines by dynamically routing BLAS calls to one or more NVIDIA GPUs present in the system, when the charateristics of the call make it speed up on a GPU." One of those Level-3 routines is sgemm (matrix multiplication), that is used extensively by ggml.c.
In theory, IF CORRECTLY CONFIGURED, NVBLAS can intercept the calls to the OpenBLAS function cblas_sgemm and accelerate it using a CUDA compatible graphics card installed in the system.
There is not much information about the specific steps to enable it, but I could piece together this step-by-step:
1-Install CUDA toolkit from the official link link
2-create the file /etc/nvblas.conf with the following contents:
/usr/lib/x86_64-linux-gnu/libopenblas.so is the location of libopenblas.so on my system, You have to point it to the correct location (should not be that different).
3-create an environment variable pointing to nvblas.conf
export NVBLAS_CONFIG_FILE=/etc/nvblas.conf
4-create an environment variable pointing to the location of libnvblas.so
export LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so.11
here is not clear which .so file is needed. For example on my system I can find the following
/usr/local/cuda/lib64/libnvblas.so
/usr/local/cuda/lib64/libnvblas.so.11
/usr/local/cuda/lib64/libnvblas.so.11.11.3.6
/usr/local/cuda-11.8/lib64/libnvblas.so
/usr/local/cuda-11.8/lib64/libnvblas.so.11
/usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6
/usr/local/cuda-11.8/lib64/libnvblas.so
/usr/local/cuda-11.8/lib64/libnvblas.so.11
/usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6
5-Download source code of whisper.cpp with
git clone https://github.com/ggerganov/whisper.cpp
6-Inside the whisper.cpp folder, execute
cmake -DWHISPER_SUPPORT_OPENBLAS=ON .
7-Inside the whisper.cpp folder, execute
make
you should have now a compiled main executable with BLAS support turned on.
8-now, at least in my case, when I run a test transcription, the program confirms that is using BLAS (BLAS = 1), but NVBLAS does not seem to be intercepting the calls. NVTOP does not show GPU usage and no nvblas.log is created.
If someone can figure out how to make this work, it has the potential to accelerate substantially the transcription speed on x64.
The text was updated successfully, but these errors were encountered: