Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate issues with GPUs with redhat UBI #126

Closed
johnstcn opened this issue Feb 28, 2025 · 8 comments
Closed

investigate issues with GPUs with redhat UBI #126

johnstcn opened this issue Feb 28, 2025 · 8 comments
Assignees

Comments

@johnstcn
Copy link
Member

Follow-up from #111

There are reports of issues using GPUs using the Red Hat 9 UBI as an internal image.

Use same setup as previous:

@johnstcn johnstcn self-assigned this Feb 28, 2025
@johnstcn
Copy link
Member Author

johnstcn commented Mar 4, 2025

The nvidia container runtime will generally try to mount in GPU libraries in the correct location for the container image. For example:

$ docker run --rm --gpus=all --runtime=nvidia redhat/ubi9:latest mount | grep -i nvidia
tmpfs on /proc/driver/nvidia type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,inode64)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-smi type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-debugdump type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-persistenced type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-cuda-mps-control type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-cuda-mps-server type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib64/libnvidia-ml.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib64/libnvidia-cfg.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib64/libnvidia-opencl.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib64/libnvidia-gpucomp.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib64/libnvidia-ptxjitcompiler.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib64/libnvidia-allocator.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib64/libnvidia-pkcs11.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib64/libnvidia-pkcs11-openssl3.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib64/libnvidia-nvvm.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/firmware/nvidia/550.90.07/gsp_ga10x.bin type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/firmware/nvidia/550.90.07/gsp_tu10x.bin type ext4 (ro,nosuid,nodev,relatime)
udev on /dev/nvidiactl type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
udev on /dev/nvidia-uvm type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
udev on /dev/nvidia-uvm-tools type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
udev on /dev/nvidia0 type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
proc on /proc/driver/nvidia/gpus/0000:01:00.0 type proc (ro,nosuid,nodev,noexec,relatime)

$ docker run --rm --gpus=all --runtime=nvidia ubuntu:22.04 mount | grep -i nvidia
tmpfs on /proc/driver/nvidia type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,inode64)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-smi type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-debugdump type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-persistenced type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-cuda-mps-control type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-cuda-mps-server type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/firmware/nvidia/550.90.07/gsp_ga10x.bin type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/firmware/nvidia/550.90.07/gsp_tu10x.bin type ext4 (ro,nosuid,nodev,relatime)
udev on /dev/nvidiactl type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
udev on /dev/nvidia-uvm type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
udev on /dev/nvidia-uvm-tools type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
udev on /dev/nvidia0 type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
proc on /proc/driver/nvidia/gpus/0000:01:00.0 type proc (ro,nosuid,nodev,noexec,relatime)

$ docker run --rm --gpus=all --runtime=nvidia archlinux:base mount | grep -i nvidia
tmpfs on /proc/driver/nvidia type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,inode64)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-smi type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-debugdump type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-persistenced type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-cuda-mps-control type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/bin/nvidia-cuda-mps-server type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/libnvidia-ml.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/libnvidia-cfg.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/libnvidia-opencl.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/libnvidia-gpucomp.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/libnvidia-ptxjitcompiler.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/libnvidia-allocator.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/libnvidia-pkcs11.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/libnvidia-pkcs11-openssl3.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/libnvidia-nvvm.so.550.90.07 type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/firmware/nvidia/550.90.07/gsp_ga10x.bin type ext4 (ro,nosuid,nodev,relatime)
/dev/mapper/ubuntu--vg-ubuntu--lv on /usr/lib/firmware/nvidia/550.90.07/gsp_tu10x.bin type ext4 (ro,nosuid,nodev,relatime)
udev on /dev/nvidiactl type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
udev on /dev/nvidia-uvm type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
udev on /dev/nvidia-uvm-tools type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
udev on /dev/nvidia0 type devtmpfs (ro,nosuid,noexec,relatime,size=32743052k,nr_inodes=8185763,mode=755,inode64)
proc on /proc/driver/nvidia/gpus/0000:01:00.0 type proc (ro,nosuid,nodev,noexec,relatime)

The logic for this appears to be split across several internal packages github.com/NVIDIA/nvidia-container-toolkit/internal/{discover,lookup,modifier}:

These modifications are performed at the container runtime layer -- the Docker daemon has no knowledge of the additional mounts.

The actual logic of determining where to place the symlinks is performed in order of precedence:

https://github.com/NVIDIA/nvidia-container-toolkit/blob/bc9ec77fdde552949022cbaf74c8b56e67702125/internal/lookup/library.go#L33-L45

@johnstcn
Copy link
Member Author

johnstcn commented Mar 4, 2025

@deansheather
Copy link
Member

deansheather commented Mar 5, 2025

I'm happy either way. We could either match the nvidia behavior for determining whether to put them in /usr/lib or /usr/lib64 or we could make some sort of hodge podge env var CODER_ENVBOX_GPU_LIB64 that just does a find/replace on the GPU mounts.

It's not immediately clear from the code you linked how they detect which one to use, but if it's easy to copy the check they're doing then we should do the same thing. If it's difficult then we should just add the env var.

edit: actually the last comment you posted wasn't loaded in for me but that seems very simple to duplicate in our Go code?

@johnstcn
Copy link
Member Author

johnstcn commented Mar 5, 2025

Yeah I'm leaning towards the last one as well!

@johnstcn
Copy link
Member Author

johnstcn commented Mar 5, 2025

One thing I suspect is not very intuititive is determining the correct value for CODER_USR_LIB_DIR. If you're not using a fancy container runtime, you'll know this beforehand. Otherwise, you'll essentially have to guess in advance where the runtime will mount the libraries, based on the fact that the Envbox image is based on Ubuntu (/usr/lib/x86_64-linux-gnu).

@johnstcn
Copy link
Member Author

johnstcn commented Mar 6, 2025

With ghcr.io/coder/envbox-preview:0.6.2-dev-a9acf2c:

[root@workspacecvm /]# nvidia-smi
Thu Mar  6 13:26:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   24C    P8              8W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
[root@workspacecvm /]# uname -a
Linux workspacecvm 6.1.124-134.200.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Jan 14 08:15:53 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
[root@workspacecvm /]# cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="9.5 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.5"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.5 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://issues.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.5
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.5"

Notes:

  • Need to mount /usr/lib64 under /var/coder/usr/lib and set CODER_USR_LIB_DIR=/var/coder/usr/lib.

@johnstcn
Copy link
Member Author

johnstcn commented Mar 6, 2025

Update: looks like this approach can somehow break dnf:

[root@workspacecvm /]# dnf update
Traceback (most recent call last):
  File "/usr/bin/dnf", line 61, in <module>
    from dnf.cli import main
  File "/usr/lib/python3.9/site-packages/dnf/__init__.py", line 30, in <module>
    import dnf.base
  File "/usr/lib/python3.9/site-packages/dnf/base.py", line 29, in <module>
    import libdnf.transaction
  File "/usr/lib64/python3.9/site-packages/libdnf/__init__.py", line 12, in <module>
    from . import conf
  File "/usr/lib64/python3.9/site-packages/libdnf/conf.py", line 13, in <module>
    from . import _conf
ImportError: libpcre.so.3: cannot open shared object file: No such file or directory

@johnstcn
Copy link
Member Author

johnstcn commented Mar 7, 2025

Tested with ghcr.io/coder/envbox-preview:0.6.2-dev-a09c7c2 on EKS:

  • CODER_INNER_IMAGE=registry.access.redhat.com/ubi9/ubi:9.5)
  • CODER_ADD_GPU=true
  • CODER_USR_LIB_DIR=/var/coder/usr/lib (/var/coder/usr/lib is host path /usr/lib64)

Succesfully ran MNIST example from https://github.com/pytorch/examples in workspace.

@johnstcn johnstcn closed this as completed Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants