[WIP] CUDA backend #1983

zcbenz · 2025-03-21T05:55:57Z

This PR is an ongoing effort to add a CUDA backend to MLX, very little things work now but you can run the tutorial example already.

To build and test:

$ cmake . -Bbuild -DMLX_BUILD_CUDA=ON -DMLX_BUILD_EXAMPLES=ON
$ cmake --build build -j 16
$ ./build/examples/cpp/tutorial
array([[2, 3],
       [4, 5]], dtype=float32)
array([[1, 1],
       [1, 1]], dtype=float32)

For development I usually use:

$ cmake . -Bbuild -DMLX_BUILD_CUDA=ON -DMLX_BUILD_EXAMPLES=ON -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache -DCMAKE_BUILD_TYPE=Debug -GNinja

Only tested on a Ubuntu 22.04 with CUDA 11.6, in theory other environments can also work but there are no testings.

This PR is not updated frequently, if anyone is interested in the realtime development, please check my forked repo.

There are mainly 2 reasons for a CUDA backend:

CUDA supports unified memory. Including hardware support in some devices, and software support for devices without hardware unified memory.
NVIDIA hardware is widely used for academic and massive computations. Being able to write/test code locally on a Mac and then deploy to super computers would make a good developer experience.

This work is sponsored by Apple.

radudiaconu0 · 2025-03-22T13:46:39Z

I wanna add rocm support based on your cuda pull request. would that be ok with you?
@zcbenz

awni · 2025-03-22T19:03:54Z

Awesome progress so far @zcbenz !!

I'm wondering what the best way to get this incorporated into MLX. I can think of a couple of options:

Once this is ready we can make this into a cuda branch in MLX and then send PRs to it. This will make it easier from a review / PR management standpoint
Just merge the backbone infra for supporting CUDA and send more incremental PRs over time

I kind of prefer the latter.. but I'm open to suggestions.

zcbenz · 2025-03-22T23:59:30Z

I wanna add rocm support based on your cuda pull request. would that be ok with you?

@radudiaconu0 Of course I'm ok with it!

Before you begin, you might want to decide how the ROCm backend lives together with CUDA backend first. I'm not familiar with ROCm, but I saw 2 patterns in projects with both backends:

Both backends share the same code, with help of #defines and name aliases.
Transpile CUDA code to HIP on the fly during build time, which is used by PyTorch.

Another thing to notice is this PR is bound to heavy changes in following weeks, I'm still experimenting what is the best interface for integration.

angeloskath · 2025-03-23T00:07:13Z

Awesome progress indeed!

Just chiming in regarding the best way to incorporate this. Imho merging often is the way to go (option 2 basically). Combined with running CUDA tests in CI it will be the easiest to live with (since we 'll know when we break it even if we don't use it). Otherwise the cuda branch would have to be constantly rebased on top of main which could be annoying.

radudiaconu0 · 2025-03-23T00:23:57Z

I wanna add rocm support based on your cuda pull request. would that be ok with you?

@radudiaconu0 Of course I'm ok with it!

Before you begin, you might want to decide how the ROCm backend lives together with CUDA backend first. I'm not familiar with ROCm, but I saw 2 patterns in projects with both backends:

Both backends share the same code, with help of #defines and name aliases.

Transpile CUDA code to HIP on the fly during build time, which is used by PyTorch.

Another thing to notice is this PR is bound to heavy changes in following weeks, I'm still experimenting what is the best interface for integration.

I would try to make a separate hip folder or to use hipify on your CUDA code to make it use rocm/hip

zcbenz · 2025-03-23T00:28:48Z

I'm wondering what the best way to get this incorporated into MLX.

I find myself keep refactoring the code when porting new kernels, I think I still need to implement a few more primitives before getting the backbone code stable, probably a few more weeks of experimenting.

Once the code is ready for review, I can split this PR into a backbone PR, and a few small PRs for each primitive. And future works would then be submitted in incremental PRs.

zcbenz · 2025-03-24T23:52:40Z

In CUDA the kernel parameters' size must be known at compile-time, i.e. we can't pass dynamic-sized shape/strides via constant memory like what the Metal kernels do.

I'm currently passing shape/strides to kernels via fixed-size cuda::std::array, which is what PyTorch has been doing. This comes with a limitation of maximum ndim in arrays, which PyTorch sets to 25, I'm using 8 for now and it can be easily changed if found not enough.

awni · 2025-03-24T23:58:39Z

This comes with a limitation of maximum ndim in arrays, which PyTorch sets to 25, I'm using 8 for now and it can be easily changed if found not enough.

Sounds great! As long as we can change it by setting one number somewhere I think that's perfectly fine.

zcbenz force-pushed the cuda branch from 6465429 to 8eb6f5f Compare March 22, 2025 06:54

zcbenz force-pushed the cuda branch from 8eb6f5f to 72646d1 Compare March 24, 2025 23:49

CUDA backend

c68b719

zcbenz force-pushed the cuda branch from 72646d1 to c68b719 Compare March 28, 2025 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] CUDA backend #1983

[WIP] CUDA backend #1983

zcbenz commented Mar 21, 2025 •

edited

Loading

radudiaconu0 commented Mar 22, 2025 •

edited

Loading

awni commented Mar 22, 2025

zcbenz commented Mar 22, 2025

angeloskath commented Mar 23, 2025

radudiaconu0 commented Mar 23, 2025

zcbenz commented Mar 23, 2025

zcbenz commented Mar 24, 2025

awni commented Mar 24, 2025

[WIP] CUDA backend #1983

Are you sure you want to change the base?

[WIP] CUDA backend #1983

Conversation

zcbenz commented Mar 21, 2025 • edited Loading

radudiaconu0 commented Mar 22, 2025 • edited Loading

awni commented Mar 22, 2025

zcbenz commented Mar 22, 2025

angeloskath commented Mar 23, 2025

radudiaconu0 commented Mar 23, 2025

zcbenz commented Mar 23, 2025

zcbenz commented Mar 24, 2025

awni commented Mar 24, 2025

zcbenz commented Mar 21, 2025 •

edited

Loading

radudiaconu0 commented Mar 22, 2025 •

edited

Loading