Enhancement: Multi-Ceph Cluster Support for Topology-Aware Volume Provisioning with ConfigMap-Based Management #5177

ahnuyh · 2025-02-25T14:51:02Z

Describe the feature you'd like to have

I would like to propose an enhancement to the existing topology-aware volume provisioning in ceph-csi to support multi-ceph cluster environments. Currently, the topology-aware provisioning assumes volume creation within a single Ceph cluster. I'd like to extend this functionality to allow mapping of specific zones to different Ceph clusters, enabling the provisioning system to select the appropriate Ceph cluster based on the zone where a pod is scheduled.

What is the value to the end user? (why is it a priority?)

This feature would provide several benefits to end users:

Elimination of single-point-of-failure: By distributing storage across multiple Ceph clusters aligned with Kubernetes zones, we can avoid having the entire regional Kubernetes setup dependent on a single Ceph cluster.
Improved data locality: Volumes would be created in the Ceph cluster that corresponds to the zone where pods are running, potentially reducing network latency.
Better isolation and fault tolerance: Storage failures would be contained within specific zones/clusters rather than affecting the entire environment.
Enhanced scalability: Organizations can scale their storage infrastructure horizontally by adding new Ceph clusters for new zones.

How will we know we have a good solution? (acceptance criteria)

The solution should meet the following criteria:

StorageClass should support specifying multiple Ceph clusters with their corresponding topology information (zones).
When a PVC is created, the provisioner should be able to identify the appropriate Ceph cluster based on the pod's scheduling constraints or node affinity rules.
The solution should seamlessly integrate with existing topology-aware scheduling in Kubernetes.
No changes should be required in applications using the PVCs.
The feature should include documentation on how to configure and use multi-cluster topology-aware provisioning.
Existing deployments using single-cluster topology should continue to work without modification.
The solution should provide clear error messages when no suitable Ceph cluster can be found for a given topology constraint.

Additional context

Here's a sequence diagram showing the proposed workflow:

sequenceDiagram
    participant User
    participant K8s as Kubernetes API
    participant CM as ConfigMap
    participant SC as StorageClass
    participant CSI as CSI Controller
    participant Scheduler as K8s Scheduler
    participant Node as K8s Node
    participant CephA as Ceph Cluster A (Zone A)
    participant CephB as Ceph Cluster B (Zone B)

    User->>K8s: Create cluster topology ConfigMap
    K8s-->>User: ConfigMap created
    User->>K8s: Create StorageClass with volumeBindingMode: WaitForFirstConsumer
    K8s-->>User: StorageClass created
    
    User->>K8s: Create StatefulSet with PVCs using StorageClass
    K8s-->>User: StatefulSet created
    K8s->>K8s: Create unbound PVCs
    
    Note over K8s, Scheduler: For each pod in StatefulSet
    K8s->>Scheduler: Schedule pod
    Scheduler->>K8s: Pod assigned to specific node in Zone A
    K8s->>CSI: CreateVolumeRequest with selected-node and zone info
    CSI->>CSI: pickZoneFromNode() extracts zone from node
    CSI->>CM: Get cluster topology configuration
    CM-->>CSI: Return topology mapping
    CSI->>CSI: Match zone to appropriate Ceph cluster
    Note over CSI: Determine that Zone A maps to Ceph Cluster A
    CSI->>CephA: Create volume
    CephA-->>CSI: Volume created
    CSI->>K8s: Create PV with node affinity for Zone A
    K8s->>K8s: Bind PVC to PV
    K8s->>Node: Start pod with bound volume
    Node->>CSI: Stage and publish volume
    CSI->>CM: Get cluster info for volume
    CM-->>CSI: Return cluster A connection details
    CSI->>CephA: Connect to volume
    CephA-->>Node: Volume mounted
    
    Note over User,K8s: Later - Update topology (no disruption to existing volumes)
    User->>K8s: Update cluster topology ConfigMap
    K8s-->>CM: ConfigMap updated
    Note over CSI: New volumes use updated topology mapping

storage-class

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-rbd-multi-cluster
provisioner: rbd.csi.ceph.com
parameters:
  clusterTopologyConfigMap: ceph-cluster-topology

configmap

apiVersion: v1
kind: ConfigMap
metadata:
  name: ceph-cluster-topology
  namespace: ceph-csi
data:
  config.json: |
    {
      "clusterTopology": [
        {
          "clusterID": "cluster-a",
          "monitors": "mon1:port,mon2:port,mon3:port",
          "zones": ["us-east-1a", "us-east-1b"],
          "pool": "replicapool",
          "cephfs": {
            "subvolumePath": "/volumes"
          }
        },
        {
          "clusterID": "cluster-b",
          "monitors": "mon4:port,mon5:port,mon6:port",
          "zones": ["us-east-1c", "us-east-1d"],
          "pool": "replicapool",
          "cephfs": {
            "subvolumePath": "/volumes"
          }
        }
      ]
    }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Multi-Ceph Cluster Support for Topology-Aware Volume Provisioning with ConfigMap-Based Management #5177

Enhancement: Multi-Ceph Cluster Support for Topology-Aware Volume Provisioning with ConfigMap-Based Management #5177

ahnuyh commented Feb 25, 2025

Enhancement: Multi-Ceph Cluster Support for Topology-Aware Volume Provisioning with ConfigMap-Based Management #5177

Enhancement: Multi-Ceph Cluster Support for Topology-Aware Volume Provisioning with ConfigMap-Based Management #5177

Comments

ahnuyh commented Feb 25, 2025

Describe the feature you'd like to have

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

Additional context