Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] CUDA backend #1983

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

[WIP] CUDA backend #1983

wants to merge 1 commit into from

Conversation

zcbenz
Copy link
Contributor

@zcbenz zcbenz commented Mar 21, 2025

This PR is an ongoing effort to add a CUDA backend to MLX, very little things work now but you can run the tutorial example already.

To build and test:

$ cmake . -Bbuild -DMLX_BUILD_CUDA=ON -DMLX_BUILD_EXAMPLES=ON
$ cmake --build build -j 16
$ ./build/examples/cpp/tutorial
array([[2, 3],
       [4, 5]], dtype=float32)
array([[1, 1],
       [1, 1]], dtype=float32)

For development I usually use:

$ cmake . -Bbuild -DMLX_BUILD_CUDA=ON -DMLX_BUILD_EXAMPLES=ON -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache -DCMAKE_BUILD_TYPE=Debug -GNinja

Only tested on a Ubuntu 22.04 with CUDA 11.6, in theory other environments can also work but there are no testings.

This PR is not updated frequently, if anyone is interested in the realtime development, please check my forked repo.


There are mainly 2 reasons for a CUDA backend:

  • CUDA supports unified memory. Including hardware support in some devices, and software support for devices without hardware unified memory.
  • NVIDIA hardware is widely used for academic and massive computations. Being able to write/test code locally on a Mac and then deploy to super computers would make a good developer experience.

This work is sponsored by Apple.

@radudiaconu0
Copy link

radudiaconu0 commented Mar 22, 2025

I wanna add rocm support based on your cuda pull request. would that be ok with you?
@zcbenz

@awni
Copy link
Member

awni commented Mar 22, 2025

Awesome progress so far @zcbenz !!

I'm wondering what the best way to get this incorporated into MLX. I can think of a couple of options:

  • Once this is ready we can make this into a cuda branch in MLX and then send PRs to it. This will make it easier from a review / PR management standpoint
  • Just merge the backbone infra for supporting CUDA and send more incremental PRs over time

I kind of prefer the latter.. but I'm open to suggestions.

@zcbenz
Copy link
Contributor Author

zcbenz commented Mar 22, 2025

I wanna add rocm support based on your cuda pull request. would that be ok with you?

@radudiaconu0 Of course I'm ok with it!

Before you begin, you might want to decide how the ROCm backend lives together with CUDA backend first. I'm not familiar with ROCm, but I saw 2 patterns in projects with both backends:

  1. Both backends share the same code, with help of #defines and name aliases.
  2. Transpile CUDA code to HIP on the fly during build time, which is used by PyTorch.

Another thing to notice is this PR is bound to heavy changes in following weeks, I'm still experimenting what is the best interface for integration.

@angeloskath
Copy link
Member

Awesome progress indeed!

Just chiming in regarding the best way to incorporate this. Imho merging often is the way to go (option 2 basically). Combined with running CUDA tests in CI it will be the easiest to live with (since we 'll know when we break it even if we don't use it). Otherwise the cuda branch would have to be constantly rebased on top of main which could be annoying.

@radudiaconu0
Copy link

I wanna add rocm support based on your cuda pull request. would that be ok with you?

@radudiaconu0 Of course I'm ok with it!

Before you begin, you might want to decide how the ROCm backend lives together with CUDA backend first. I'm not familiar with ROCm, but I saw 2 patterns in projects with both backends:

  1. Both backends share the same code, with help of #defines and name aliases.

  2. Transpile CUDA code to HIP on the fly during build time, which is used by PyTorch.

Another thing to notice is this PR is bound to heavy changes in following weeks, I'm still experimenting what is the best interface for integration.

I would try to make a separate hip folder or to use hipify on your CUDA code to make it use rocm/hip

@zcbenz
Copy link
Contributor Author

zcbenz commented Mar 23, 2025

I'm wondering what the best way to get this incorporated into MLX.

I find myself keep refactoring the code when porting new kernels, I think I still need to implement a few more primitives before getting the backbone code stable, probably a few more weeks of experimenting.

Once the code is ready for review, I can split this PR into a backbone PR, and a few small PRs for each primitive. And future works would then be submitted in incremental PRs.

@zcbenz
Copy link
Contributor Author

zcbenz commented Mar 24, 2025

In CUDA the kernel parameters' size must be known at compile-time, i.e. we can't pass dynamic-sized shape/strides via constant memory like what the Metal kernels do.

I'm currently passing shape/strides to kernels via fixed-size cuda::std::array, which is what PyTorch has been doing. This comes with a limitation of maximum ndim in arrays, which PyTorch sets to 25, I'm using 8 for now and it can be easily changed if found not enough.

@awni
Copy link
Member

awni commented Mar 24, 2025

This comes with a limitation of maximum ndim in arrays, which PyTorch sets to 25, I'm using 8 for now and it can be easily changed if found not enough.

Sounds great! As long as we can change it by setting one number somewhere I think that's perfectly fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants