You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After upgrading from 3.11.0 to 3.12.1 (using helm), the csi-provisioner log is getting constant permission errors. I updated both rbd and cephfs, but it's only happening in the rbd provisioner.
Environment details
Image/version of Ceph CSI driver : 3.12.1
Helm chart version : 3.12.1
Kernel version : 6.6.6
Mounter used for mounting PVC: krbd
Kubernetes cluster version : 1.30.3
Ceph cluster version : 18.2.4
Steps to reproduce
Steps to reproduce the behavior:
Install and configure rbd 3.11.0 using helm.
Set up a storage class called "rbd".
Create and mount a PVC, watch it get provisioned.
Unmount and delete the PVC.
Update helm chart to 3.12.1.
Create and mount a PVC, watch it never get provisioned.
Check the csi-provisioner log to see non-stop permission errors.
Actual results
rbd volume not provisioned, pod never starts.
Expected behavior
Provisioner provisions without error, same as 1.11.0.
Logs
The csi-provisioner container logs repeat these messages at some interval:
W0825 13:10:32.334012 1 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
E0825 13:10:32.334037 1 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.CSINode: failed to list *v1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
W0825 13:10:32.334128 1 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" in API group "" at the cluster scope
E0825 13:10:32.334140 1 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" in API group "" at the cluster scope
W0825 13:10:33.398178 1 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
E0825 13:10:33.398197 1 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.CSINode: failed to list *v1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
W0825 13:10:33.480147 1 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" in API group "" at the cluster scope
E0825 13:10:33.480164 1 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" in API group "" at the cluster scope
I see lots of errors in csi-snapshotter logs too, but I think that's unrelated. (I don't have any VSCs defined.)
I see nothing of note in other ceph-csi-rbd logs.
In the application's namespace, I see events like:
0s Normal ExternalProvisioning persistentvolumeclaim/data-mariadb-0 Waiting for a volume to be created either by the external provisioner 'rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
Additional context
Terraform installation & configuration
This is how I installed and configured it. To update, I only changed the "version" line.
I tested provisioning and removal using the bitnami/mariadb helm chart. I am using the same chart, configured the same way, for old and new versions of ceph-csi-rbd. Here's what the v3.11.0 csi-provisioner log looks like:
I0825 12:36:20.610487 1 leaderelection.go:260] successfully acquired lease ceph-csi-rbd/rbd-csi-ceph-com
I0825 12:36:20.710870 1 controller.go:811] Starting provisioner controller rbd.csi.ceph.com_ceph-csi-rbd-provisioner-66bff55c47-9p97p_4ced4187-44dc-4ba1-9921-9f12d48b958e!
I0825 12:36:20.710892 1 clone_controller.go:66] Starting CloningProtection controller
I0825 12:36:20.710899 1 volume_store.go:97] Starting save volume queue
I0825 12:36:20.710911 1 clone_controller.go:82] Started CloningProtection controller
I0825 12:36:20.811803 1 controller.go:1366] provision "test/data-mariadb-0" class "rbd": started
I0825 12:36:20.811818 1 controller.go:860] Started provisioner controller rbd.csi.ceph.com_ceph-csi-rbd-provisioner-66bff55c47-9p97p_4ced4187-44dc-4ba1-9921-9f12d48b958e!
I0825 12:36:20.812329 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test", Name:"data-mariadb-0", UID:"2a45a0a1-d09c-495a-9846-76eed47a0b2c", APIVersion:"v1", ResourceVersion:"65771679", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "test/data-mariadb-0"
I0825 12:36:21.247153 1 controller.go:1449] provision "test/data-mariadb-0" class "rbd": volume "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c" provisioned
I0825 12:36:21.247168 1 controller.go:1462] provision "test/data-mariadb-0" class "rbd": succeeded
I0825 12:36:21.699252 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test", Name:"data-mariadb-0", UID:"2a45a0a1-d09c-495a-9846-76eed47a0b2c", APIVersion:"v1", ResourceVersion:"65771679", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c
I0825 13:07:54.161702 1 controller.go:1509] delete "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": started
I0825 13:07:54.586347 1 controller.go:1524] delete "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": volume deleted
I0825 13:07:54.926752 1 controller.go:1561] delete "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": failed to remove finalizer for persistentvolume: Operation cannot be fulfilled on persistentvolumes "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": the object has been modified; please apply your changes to the latest version and try again
W0825 13:07:54.926775 1 controller.go:989] Retrying syncing volume "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c", failure 0
E0825 13:07:54.926798 1 controller.go:1007] error syncing volume "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": Operation cannot be fulfilled on persistentvolumes "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": the object has been modified; please apply your changes to the latest version and try again
I0825 13:07:54.926811 1 controller.go:1509] delete "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": started
I0825 13:07:55.034306 1 controller.go:1524] delete "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": volume deleted
I0825 13:07:55.154234 1 controller.go:1569] delete "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": persistentvolume deleted succeeded
In comparison, v3.12.1 gets the lease, fails to create its watches, and keeps retrying forever; it doesn't even notice the PVC.
In the helm chart, permission is only granted when the domainLabels list is non-empty, but it's now empty by default (#4776). But the provisioner is still trying to read node/csinode stuff, and apparently can't finish its setup phase without it. So that seems to be why it's failing now.
At this point, I'm feeling a little lost. I feel like an enabled feature with an empty configuration should do nothing. But this is doing a little too much nothing 😁. Should the provisioner have permission to look at nodes, regardless of whether domain labels are defined in the helm chart?
I saw the discussion of command line arguments in #4777 and #4790. Was that intended to fix this issue? I checked and my provisioner is indeed being passed the --immediate-topology=false flag.
The text was updated successfully, but these errors were encountered:
For now, I've worked around it by adding the necessary permissions to the clusterrole. Once I did that, the provisioner finished its startup phase, saw the PVC, provisioned the PV and life is good.
Hmm. With the above workaround, the pod is started and apparently running. But I see this event on the pod that mounted the new PVC:
Warning FailedScheduling 2m37s default-scheduler 0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.
Other than that event, I don't see the word "immediate" anywhere in the pod, PVC, or PV. I don't really know what it means.
Describe the bug
After upgrading from 3.11.0 to 3.12.1 (using helm), the csi-provisioner log is getting constant permission errors. I updated both rbd and cephfs, but it's only happening in the rbd provisioner.
Environment details
Steps to reproduce
Steps to reproduce the behavior:
Actual results
rbd volume not provisioned, pod never starts.
Expected behavior
Provisioner provisions without error, same as 1.11.0.
Logs
The
csi-provisioner
container logs repeat these messages at some interval:I see lots of errors in
csi-snapshotter
logs too, but I think that's unrelated. (I don't have any VSCs defined.)I see nothing of note in other ceph-csi-rbd logs.
In the application's namespace, I see events like:
Additional context
Terraform installation & configuration
This is how I installed and configured it. To update, I only changed the "version" line.
What the previous (successful) version looks like
I tested provisioning and removal using the bitnami/mariadb helm chart. I am using the same chart, configured the same way, for old and new versions of ceph-csi-rbd. Here's what the v3.11.0
csi-provisioner
log looks like:In comparison, v3.12.1 gets the lease, fails to create its watches, and keeps retrying forever; it doesn't even notice the PVC.
RBAC
In the installation manifest, permission to read nodes and csinodes is always granted.
In the helm chart, permission is only granted when the domainLabels list is non-empty, but it's now empty by default (#4776). But the provisioner is still trying to read node/csinode stuff, and apparently can't finish its setup phase without it. So that seems to be why it's failing now.
At this point, I'm feeling a little lost. I feel like an enabled feature with an empty configuration should do nothing. But this is doing a little too much nothing 😁. Should the provisioner have permission to look at nodes, regardless of whether domain labels are defined in the helm chart?
I saw the discussion of command line arguments in #4777 and #4790. Was that intended to fix this issue? I checked and my provisioner is indeed being passed the
--immediate-topology=false
flag.The text was updated successfully, but these errors were encountered: