Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(xunix): improve handling of gpu library mounts #129

Merged
merged 7 commits into from
Mar 7, 2025

Conversation

johnstcn
Copy link
Member

@johnstcn johnstcn commented Mar 6, 2025

This PR fixes some issues Bjorn found when testing the changes in #127 as well as some other lingering issues I'd noticed:

  1. The nvidia container runtime will mount libraries with the driver version number appended to them, and creates symlinks to those files:
root@ad0724e97635:/# ls -l /usr/lib/x86_64-linux-gnu/ | grep -Ei 'libgl|nvidia|vulkan|cuda'
lrwxrwxrwx  1 root root       12 Mar  6 23:19 libcuda.so -> libcuda.so.1
lrwxrwxrwx  1 root root       20 Mar  6 23:19 libcuda.so.1 -> libcuda.so.550.90.07
-rwxr-xr-x  1 root root 28581024 Dec 23 16:41 libcuda.so.550.90.07
lrwxrwxrwx  1 root root       28 Mar  6 23:19 libcudadebugger.so.1 -> libcudadebugger.so.550.90.07
-rwxr-xr-x  1 root root 10524136 Dec 23 16:41 libcudadebugger.so.550.90.07
lrwxrwxrwx  1 root root       32 Mar  6 23:19 libnvidia-allocator.so.1 -> libnvidia-allocator.so.550.90.07
-rwxr-xr-x  1 root root   168776 Dec 23 16:41 libnvidia-allocator.so.550.90.07
lrwxrwxrwx  1 root root       26 Mar  6 23:19 libnvidia-cfg.so.1 -> libnvidia-cfg.so.550.90.07
-rwxr-xr-x  1 root root   398968 Dec 23 16:41 libnvidia-cfg.so.550.90.07
-rwxr-xr-x  1 root root 43659040 Dec 23 16:41 libnvidia-gpucomp.so.550.90.07
lrwxrwxrwx  1 root root       25 Mar  6 23:19 libnvidia-ml.so.1 -> libnvidia-ml.so.550.90.07
-rwxr-xr-x  1 root root  2078360 Dec 23 16:41 libnvidia-ml.so.550.90.07
lrwxrwxrwx  1 root root       27 Mar  6 23:19 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.550.90.07
-rwxr-xr-x  1 root root 86842616 Dec 23 16:41 libnvidia-nvvm.so.550.90.07
lrwxrwxrwx  1 root root       29 Mar  6 23:19 libnvidia-opencl.so.1 -> libnvidia-opencl.so.550.90.07
-rwxr-xr-x  1 root root 23494344 Dec 23 16:41 libnvidia-opencl.so.550.90.07
-rwxr-xr-x  1 root root    10176 Dec 23 16:41 libnvidia-pkcs11-openssl3.so.550.90.07
-rwxr-xr-x  1 root root    10168 Dec 23 16:41 libnvidia-pkcs11.so.550.90.07
lrwxrwxrwx  1 root root       37 Mar  6 23:19 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.550.90.07
-rwxr-xr-x  1 root root 28674464 Dec 23 16:41 libnvidia-ptxjitcompiler.so.550.90.07

We had been mounting in the mounts to e.g. libnvidia-ml.so.550.90.07 but not the symlinks, as those do not show up as mounts:

root@ad0724e97635:/# mount | grep -Ei 'libgl|nvidia|vulkan|cuda'
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-smi type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-debugdump type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-persistenced type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-cuda-mps-control type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-cuda-mps-server type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libcudadebugger.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/firmware/nvidia/550.90.07/gsp_ga10x.bin type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/firmware/nvidia/550.90.07/gsp_tu10x.bin type ext4 (ro,nosuid,nodev,relatime)
udev on /dev/nvidiactl type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
udev on /dev/nvidia-uvm type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
udev on /dev/nvidia-uvm-tools type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
udev on /dev/nvidia0 type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)

This modifies the behaviour of xunix.GPUs to also return the symlinks to those driver files, so that we also mount them inside the inner container.

  1. This is likely an oversight from fix(xunix): also mount shared symlinked shared object files #123 but appears to also have been present a longer time. gpuExtraRegex will search for files matching the expression (?i)(libgl|nvidia|vulkan|cuda). It turns out this will also match libglib-X.Y.so. This is something we want to avoid, as it can cause dnf to break when running Redhat-based distros in the inner container. Shout out to Bjorn for spotting this!

More info:

libglib-2.0.so.0 from my host machine depended on libpcre.so.3:

$ ldd /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
        linux-vdso.so.1 (0x00007ffcf7dfa000)
        libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x000079c5fc7ee000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000079c5fc707000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000079c5fc400000)
        /lib64/ld-linux-x86-64.so.2 (0x000079c5fc9a8000)

The system python in the inner image appears to have been trying to pick it up and failing due to the missing dependency:

[root@workspacecvm /]# dnf update
Traceback (most recent call last):
  File "/usr/bin/dnf", line 61, in <module>
    from dnf.cli import main
  File "/usr/lib/python3.9/site-packages/dnf/__init__.py", line 30, in <module>
    import dnf.base
  File "/usr/lib/python3.9/site-packages/dnf/base.py", line 29, in <module>
    import libdnf.transaction
  File "/usr/lib64/python3.9/site-packages/libdnf/__init__.py", line 12, in <module>
    from . import conf
  File "/usr/lib64/python3.9/site-packages/libdnf/conf.py", line 13, in <module>
    from . import _conf
ImportError: libpcre.so.3: cannot open shared object file: No such file or directory
  1. Also adds more GPU integration tests, including one that uses the sample CUDA image from NVidia.

NOTE: as these integration tests do not run automatically in CI, you will need to run them manually on a physical machine that has the NVidia container runtime installed:

CODER_TEST_INTEGRATION=1 go test -v -count=1 ./integration -test.run='^TestDocker_Nvidia/'

@johnstcn johnstcn self-assigned this Mar 6, 2025
Comment on lines +96 to +117
t.Run("EmptyHostUsrLibDir", func(t *testing.T) {
t.Parallel()
ctx, cancel := context.WithCancel(context.Background())
t.Cleanup(cancel)
emptyUsrLibDir := t.TempDir()

// Start the envbox container.
ctID := startEnvboxCmd(ctx, t, integrationtest.UbuntuImage, "root",
"-v", emptyUsrLibDir+":/var/coder/usr/lib",
"--env", "CODER_ADD_GPU=true",
"--env", "CODER_USR_LIB_DIR=/var/coder/usr/lib",
"--runtime=nvidia",
"--gpus=all",
)

ofs := outerFiles(ctx, t, ctID, "/usr/lib/x86_64-linux-gnu/libnv*")
// Assert invariant: the outer container has the files we expect.
require.NotEmpty(t, ofs, "failed to list outer container files")
// Assert that expected files are available in the inner container.
assertInnerFiles(ctx, t, ctID, "/usr/lib/x86_64-linux-gnu/libnv*", ofs...)
assertInnerNvidiaSMI(ctx, t, ctID)
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: this tests that we can get by with no extra files in CODER_USR_LIB_DIR

Comment on lines +125 to +129
if !gpuExtraRegex.MatchString(path) {
return nil
}

if !sharedObjectRegex.MatchString(path) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: this makes the control flow a little easier to read; I accidentally removed this but it still performs an important task

gpuExtraRegex = regexp.MustCompile("(?i)(libgl|nvidia|vulkan|cuda)")
gpuEnvRegex = regexp.MustCompile("(?i)nvidia")
gpuMountRegex = regexp.MustCompile(`(?i)(nvidia|vulkan|cuda)`)
gpuExtraRegex = regexp.MustCompile(`(?i)(libgl(e|sx|\.)|nvidia|vulkan|cuda)`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: modified this regex to hopefully match the right things and not the wrong things

@johnstcn johnstcn merged commit a09c7c2 into main Mar 7, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants