Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trident operator and controller are deployed on a worker node #981

Open
kimuraos opened this issue Feb 21, 2025 · 4 comments
Open

Trident operator and controller are deployed on a worker node #981

kimuraos opened this issue Feb 21, 2025 · 4 comments
Labels

Comments

@kimuraos
Copy link

Describe the bug
Operators should be deployed on control-plane nodes, but Trident operator and controller are deployed on a worker node.

$ oc get pod -n trident -o wide
NAME                                  READY   STATUS    RESTARTS         AGE   IP             NODE               NOMINATED NODE   READINESS GATES
trident-controller-575f5cb7d4-j8lvh   6/6     Running   0                22d   10.131.1.191   worker-y05by       <none>           <none>
trident-node-linux-k6rgf              2/2     Running   1 (22d ago)      22d   192.168.30.2   worker-y05by       <none>           <none>
trident-node-linux-kw6n4              2/2     Running   1493 (20h ago)   21d   192.168.30.3   worker-btov3       <none>           <none>
trident-node-linux-qml8t              2/2     Running   1 (22d ago)      22d   192.168.30.7   controller-pagau   <none>           <none>
trident-node-linux-tq2sf              2/2     Running   1 (22d ago)      22d   192.168.30.5   controller-t2o32   <none>           <none>
trident-node-linux-vpfvc              2/2     Running   1 (22d ago)      22d   192.168.30.6   controller-ke1nf   <none>           <none>
trident-operator-7b46b5b986-lx4jh     1/1     Running   0                21d   10.131.1.192   worker-y05by       <none>           <none>

Environment

  • Trident version: 24.10.0
  • Kubernetes orchestrator: OpenShift v4.16

To Reproduce
Allways after installation of trident-operator on OpenShift.

I tried to specify controllerPluginNodeSelector in TridentOrchestrator CRD, but it causes the trident-controller pod becomes pendint state.

apiVersion: trident.netapp.io/v1
kind: TridentOrchestrator
metadata:
  name: trident
spec:
  debug: false
  namespace: trident
  silenceAutosupport: true
  controllerPluginNodeSelector:
    nodetype: master
$ oc get pod -n trident
NAME                                 READY   STATUS    RESTARTS       AGE
trident-controller-95bc5645f-ljql7   0/6     Pending   0              108s
...
$ oc describe pod trident-controller-95bc5645f-ljql7 -n trident
...
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  50s   default-scheduler  0/5 nodes are available: 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.

Expected behavior
Most of other operators are deployed on control-plane nodes without specifying nodeSelector like:

$ oc get pod -n nvidia-network-operator -o wide
NAME                                                         READY   STATUS    RESTARTS   AGE     IP            NODE               NOMINATED NODE   READINESS GATES
nvidia-network-operator-controller-manager-9fccfb57f-d8jsd   1/1     Running   0          3d22h   10.129.0.43   controller-pagau   <none>           <none>

Following document site explains that trident-controller can be deplyed on master (i.e., control-plane) nodes.
https://docs.netapp.com/us-en/trident/trident-get-started/kubernetes-customize-deploy.html#sample-configurations

apiVersion: trident.netapp.io/v1
kind: TridentOrchestrator
metadata:
  name: trident
spec:
  debug: true
  namespace: trident
  controllerPluginNodeSelector:
    nodetype: master
  nodePluginNodeSelector:
    storage: netapp
@kimuraos kimuraos added the bug label Feb 21, 2025
@enneitex
Copy link

Hi,
For trident-operator deployment, you need to set this two values (based on your error message, maybe it needs an etcd toleration too):

nodeSelector:
  nodetype: master

tolerations:
- key: "node-role.kubernetes.io/master"
  operator: "Exists"
  effect: "NoSchedule"

Same for trident-controller deployment with tridentControllerPluginNodeSelector and tridentControllerPluginTolerations

@kimuraos
Copy link
Author

Hi enneitex,
Thanks for your comment.

I tried to change TridentOrchestrator CRD to:

apiVersion: trident.netapp.io/v1
kind: TridentOrchestrator
metadata:
  name: trident
spec:
  debug: false
  namespace: trident
  silenceAutosupport: true
  controllerPluginNodeSelector:
    nodetype: master
  controllerPluginTolerations:
  - key: "node-role.kubernetes.io/master"
    operator: "Exists"
    effect: "NoSchedule"

ref: Customize Trident operator installation

As the result, the controller pod is still pending state.

$ oc describe pod trident-controller-56c4b48cc8-wmjhr -n trident 
Name:             trident-controller-56c4b48cc8-wmjhr
Namespace:        trident
Priority:         0
Service Account:  trident-controller
Node:             <none>
Labels:           app=controller.csi.trident.netapp.io
                  pod-template-hash=56c4b48cc8
Annotations:      openshift.io/required-scc: trident-controller
                  openshift.io/scc: trident-controller
Status:           Pending
...
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  34s   default-scheduler  0/5 nodes are available: 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.

Control-plane nodes have a taint of node-role.kubernetes.io/master:NoSchedule only.

$ oc describe node controller-ke1nf
Name:               controller-ke1nf
Roles:              control-plane,master
...
Taints:             node-role.kubernetes.io/master:NoSchedule
...

On the other hand, Trident Operator is automatically deployed by OLM on OpenShift with Subscription resource.
For example of nvidia-network-operator listed above, there are some specifications for affinity in ClusterServiceVersion (related to #980).
Trident operator doesn't have any specifications for affinity in its CSV.

$ oc describe csv nvidia-network-operator -n nvidia-network-operator
Name:         nvidia-network-operator.v24.10.1
...
      Deployments:
        Label:
          Control - Plane:  nvidia-network-operator-controller
        Name:               nvidia-network-operator-controller-manager
        Spec:
          Replicas:  1
          Selector:
            Match Labels:
              Control - Plane:  nvidia-network-operator-controller
          Strategy:
          Template:
            Metadata:
              Annotations:
                kubectl.kubernetes.io/default-container:  manager
              Creation Timestamp:                         <nil>
              Labels:
                Control - Plane:                            nvidia-network-operator-controller
                nvidia.com/ofed-driver-upgrade-drain.skip:  true
            Spec:
              Affinity:
                Node Affinity:
                  Preferred During Scheduling Ignored During Execution:
                    Preference:
                      Match Expressions:
                        Key:       node-role.kubernetes.io/master
                        Operator:  In
                        Values:
                          
                    Weight:  1
                    Preference:
                      Match Expressions:
                        Key:       node-role.kubernetes.io/control-plane
                        Operator:  In
                        Values:
                          
                    Weight:  1
...

@enneitex
Copy link

Are you sure the label nodetype: master exists on your control-planes ? If not change controllerPluginNodeSelector and nodeSelector to match an existing one

@kimuraos
Copy link
Author

There is no nodetype: master label. I didn't check it. (I just follow the sample)
Even if controllerPluginNodeSelector is omitted, the results are same.
I deleted the previous TridentOrchestrator and re-applyed following YAML, but the result was same.

apiVersion: trident.netapp.io/v1
kind: TridentOrchestrator
metadata:
  name: trident
spec:
  debug: false
  namespace: trident
  silenceAutosupport: true
  controllerPluginTolerations:
  - key: "node-role.kubernetes.io/master"
    operator: "Exists"
    effect: "NoSchedule"
$ oc describe pod trident-controller-56c4b48cc8-4khff -n trident
...
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  2m35s  default-scheduler  0/5 nodes are available: 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
$ oc describe node controller-ke1nf 
Name:               controller-ke1nf
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=controller-ke1nf
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node-role.kubernetes.io/master=
                    node.openshift.io/os_id=rhcos
...
Taints:             node-role.kubernetes.io/master:NoSchedule

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants