-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbd: add capability to automatically enable read affinity #3639
Conversation
e995494
to
33a4325
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this be tested? Some steps/guidance and complete examples for that should probably be documented too.
internal/csi-common/driver.go
Outdated
topology map[string]string | ||
topology map[string]string | ||
// crushLocationMap that will be used to enable read affinity. | ||
CrushLocationMap map[string]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be exported? Ideally it isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be exported? Ideally it isn't.
Yes, it needs to exported to be accessible to individual drivers to use in rpc function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is then this different from topology
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is then this different from
topology
?
Currently processing required for topology is identical to cephfs and rbd (topology feature is itself in beta stage with lot of TODOs),
Whereas processing required for CrushLocation maybe different for cephfs and rbd,(only rbd support read affinity at the moment.). Therefore, its better to let individual drivers handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CSIDriver
struct does not expose any attributes at the moment. It would be better to provide access/get function(s) for it instead. Sortof what you have in internal/util/crushlocation.go
, but maybe even further and generate the crush_location=
rbd commandline option?
33a4325
to
3792527
Compare
I've documented it here https://github.com/ceph/ceph-csi/pull/3639/files#diff-ffd3d77379c6dad83a2a35da3df2161ac6e62ee31396ce0657ba06662ef0b965 . We just need to add labels to kubernetes node, set those cmdline args for rbd nodeplugin daemonset container and The rendered form will look like this https://github.com/ceph/ceph-csi/blob/37925274b7de1150ca0adc79f703698d89d539ad/docs/deploy-rbd.md#read-affinity-using-crush-locations-for-rbd-volumes |
3792527
to
4eb1778
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to add an e2e test for this one. To make sure the labels are set on the rbd mapped device.
@@ -81,6 +81,13 @@ func init() { | |||
"", | |||
"list of kubernetes node labels, that determines the topology"+ | |||
" domain the node belongs to, separated by ','") | |||
flag.BoolVar(&conf.EnableReadAffinity, "enable-read-affinity", false, "enable read affinity") | |||
flag.StringVar( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a document that setting this will use the same affinity to all the ceph clusters that cephcsi is connected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a document that setting this will use the same affinity to all the ceph clusters that cephcsi is connected.
I think that is implied. There's no statement informing users otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a note doesn't hurt and it helps users to understand the impact of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single Ceph-CSI deployment supports multiple Ceph clusters... What is the reason to have this as a commandline parameter, and not a setting in the StorageClass or the ConfigMap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single Ceph-CSI deployment supports multiple Ceph clusters... What is the reason to have this as a commandline parameter, and not a setting in the StorageClass or the ConfigMap?
Its simpler to have it as comandline parameter.
In most of the cases, clusters will share the same crush map characteristics and this design
should be sufficient.
StorageClasses are immutable, therefore existing rbd volumes will not be able to use this feature.
The current implementation covers most of the cases and there should be no need to have this in
a configmap with unique labels per cluster. If such a need arises in the future, it will be easy to
override cmdline labels and prefer labels per cluster in configmap.
One more reason is of performance, using cmdline args will enable us to restart the container and
fetch the labels only at the time of restart instead of having to fetch it for each map/mount requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its simpler to have it as comandline parameter.
For who? You as a developer, or administrators deploying Ceph-CSI? I'd guess a StorageClass+ConfigMap option would be much easier for administrators.
I have no idea if CRUSH maps are commonly the same on different clusters. My assumption would be that it is highly unlikely, and they would be different per cluster.
IMHO, putting this change in the StorageClass would be suitable, as a SC already refers to a particular cluster.
In case of users that upgrade and want to benefit from this feature, placing it in the ConfigMap should work too?
Fetching labels of nodes needs to be dynamic anyway, I guess? What if a cluster changes, nodes get removed, added, relocated? Autoscaling a Kubernetes cluster is a thing, so fetching the labels only at startup of the provisioner isn't correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its simpler to have it as comandline parameter.
For who? You as a developer, or administrators deploying Ceph-CSI? I'd guess a StorageClass+ConfigMap option would be much easier for administrators.
Both, I'd like to think of this feature as set and forget.
Topology of cluster with existing nodes is rarely changed. OSDs are assinged using topology labels on the node.
rbd volumes use the same (with cephcsi's help) to figure out their nearest OSD for read affinity.
I have no idea if CRUSH maps are commonly the same on different clusters. My assumption would be that it is highly unlikely, and they would be different per cluster.
https://docs.ceph.com/en/quincy/rados/operations/crush-map/#types-and-buckets
We really only care about the CRUSH map type and value, not the exact structure.
If the cluster is situated in kubernetes, then
OSD are assigned according to topology labels of the node. This is the same set of topology that will be used by cephcsi to extract the crush location and pass it on to the device while mapping.
CRUSH maps will be not be very different from each other among ceph clusters running on the same kubernetes cluster.
IMHO, putting this change in the StorageClass would be suitable, as a SC already refers to a particular cluster.
Existing pvc/storageclasses will not be able to take advantage of this feature.
In case of users that upgrade and want to benefit from this feature, placing it in the ConfigMap should work too?
Fetching labels of nodes needs to be dynamic anyway, I guess? What if a cluster changes, nodes get removed, added, relocated? Autoscaling a Kubernetes cluster is a thing, so fetching the labels only at startup of the provisioner isn't correct.
This is happening in the nodeplugin pod and will be used during mapping.
In any of the scenario described above, cephcsi nodeplugin pod will get restarted and we will fetch it again.
Each nodeplugin pod cares only about its node's labels.
- avoids repetition of same labels in each storageclass / configmap per cluster section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can always move it to the ConfigMap if more flexibility is required.
Right, of course it is only used in the node-plugin, so changes to the number of nodes in the Kubernetes cluster is handled automatically.
4eb1778
to
b409c25
Compare
internal/rbd/driver/driver.go
Outdated
@@ -168,7 +182,7 @@ func (r *Driver) Run(conf *util.Config) { | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Madhu-1 @nixpanic
I am removing this segment of code.
ceph-csi/internal/rbd/driver/driver.go
Lines 166 to 176 in 650c522
if !conf.IsControllerServer && !conf.IsNodeServer { | |
topology, err = util.GetTopologyFromDomainLabels(conf.DomainLabels, conf.NodeID, conf.DriverName) | |
if err != nil { | |
log.FatalLogMsg(err.Error()) | |
} | |
r.ns, err = NewNodeServer(r.cd, conf.Vtype, topology) | |
if err != nil { | |
log.FatalLogMsg("failed to start node server, err %v\n", err) | |
} | |
r.cs = NewControllerServer(r.cd) | |
} |
changes to controllerserver and nodeserver standalone if statements are not propagated to this part from a long time (see krbd feature and replication sever),
and I don't think this section is necessary.
The complexity of this function will go down and it'll be much clearer.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to remove it not required and make its less complex
4b902d3
to
3796e54
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, good to see if you can add an E2E for this one to make sure functionality works and never breaks in future.
if !conf.IsControllerServer && !conf.IsNodeServer { | ||
topology, err = util.GetTopologyFromDomainLabels(conf.DomainLabels, conf.NodeID, conf.DriverName) | ||
if err != nil { | ||
log.FatalLogMsg(err.Error()) | ||
} | ||
r.ns, err = NewNodeServer(r.cd, conf.Vtype, topology) | ||
if err != nil { | ||
log.FatalLogMsg("failed to start node server, err %v\n", err) | ||
} | ||
r.cs = NewControllerServer(r.cd) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This takes out the possibility of running cephcsi as both controller and provisioner. good to add this to release notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done added a note in tracker issue #3336 (comment)
opened todo issue for e2e and helm options.
#3642
internal/rbd/rbd_attach.go
Outdated
} | ||
|
||
var b strings.Builder | ||
b.WriteString("read_from_replica=localize,crush_location=") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be use the input string as const
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be use the input string as
const
?
This currently used at only one place, let's do so when we'll be using it at multiple places.
@@ -68,6 +68,15 @@ spec: | |||
# and pass the label names below, for CSI to consume and advertise | |||
# its equivalent topology domain | |||
# - "--domainlabels=failure-domain/region,failure-domain/zone" | |||
# | |||
# Options to enable read affinity. | |||
# If enabled Ceph CSI will fetch labels from kubernetes node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# If enabled Ceph CSI will fetch labels from kubernetes node | |
# If enabled Ceph CSI will fetch labels from kubernetes node and |
cmd/cephcsi.go
Outdated
"crush-location-labels", | ||
"", | ||
"list of kubernetes node labels, that determines the"+ | ||
" crush location the node belongs to, separated by ','") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRUSH is an abbreviation so it should be capitalized in the option descriptions and documentation:
" crush location the node belongs to, separated by ','") | |
" CRUSH location the node belongs to, separated by ','") |
internal/util/crushlocation_test.go
Outdated
args: input{ | ||
crushLocationLabels: "topology.io/zone,topology.io/rack", | ||
nodeLabels: map[string]string{ | ||
"topology.io/region": "zone1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"topology.io/region": "zone1", | |
"topology.io/region": "region1", |
internal/util/crushlocation.go
Outdated
nameIdx := strings.IndexRune(key, keySeparator) | ||
crushLocationType := strings.TrimSpace(key[nameIdx+1:]) | ||
// replace "." with "-" to satisfy ceph crush map. | ||
value = strings.Replace(strings.TrimSpace(value), ".", "-", -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed? Are dots in zone and region names expected?
If nothing else, these kinds of transformations should be highlighted in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed? Are dots in zone and region names expected?
yes there maybe dots in the values, there's no such restriction on kubernetes side.
If nothing else, these kinds of transformations should be highlighted in the documentation.
This is adapted from rook code which said the same thing,
I'll add a note in documentation.
to Ceph CSI RBD daemonset pod "csi-rbdplugin" container, resulting in Ceph CSI adding | ||
`"--options read_from_replica=localize,crush_location=zone:east"` krbd options during | ||
rbd map operation. | ||
If enabled, this option will be added to all RBD volumes mapped by Ceph CSI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to also note that, for this setting to be effective, the OSDs should be assigned to the same zones, regions, etc respectively on the cluster side (in the CRUSH map). Just declaring that a particular node belongs to a particular zone on the client side is insufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've modified the first sentence to elaborate on it,
But users should find more info on ceph krbd map options site which is linked in second sentence.
internal/rbd/rbd_attach.go
Outdated
// appendCrushLocationMapOptions parses crushLocationMap to the format | ||
// "--read_from_replica=localize,--crush_location=key1:value1|key3:value2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actual parsing happens in getCrushLocationMap
:
// appendCrushLocationMapOptions parses crushLocationMap to the format | |
// "--read_from_replica=localize,--crush_location=key1:value1|key3:value2" | |
// appendCrushLocationMapOptions uses crushLocationMap to the format | |
// "read_from_replica=localize,crush_location=key1:value1|key2:value2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to converts
now
internal/rbd/rbd_attach_test.go
Outdated
{ | ||
name: "filled mapOptions and crushLocationMap", | ||
args: input{ | ||
mapOptions: "read-only=true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter for this unit test, but all krbd options use underscores (so read_only
, not read-only
). Someone grepping for mapOptions
examples in the future may get confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.ceph.com/en/latest/man/8/rbd/#kernel-rbd-krbd-options
changed it to notrim
instead to avoid confusion
3796e54
to
9ee3a66
Compare
/test ci/centos/mini-e2e-helm/k8s-1.26 |
/test ci/centos/mini-e2e/k8s-1.23 |
/test ci/centos/mini-e2e/k8s-1.24 |
/test ci/centos/mini-e2e/k8s-1.25 |
/test ci/centos/mini-e2e/k8s-1.26 |
/test ci/centos/upgrade-tests-cephfs |
/test ci/centos/upgrade-tests-rbd |
/retest ci/centos/mini-e2e/k8s-1.25 |
|
@Mergifyio refresh |
✅ Pull request refreshed |
/test ci/centos/k8s-e2e-external-storage/1.23 |
/test ci/centos/k8s-e2e-external-storage/1.24 |
/test ci/centos/k8s-e2e-external-storage/1.25 |
/test ci/centos/k8s-e2e-external-storage/1.26 |
/test ci/centos/mini-e2e-helm/k8s-1.23 |
/test ci/centos/mini-e2e-helm/k8s-1.24 |
/test ci/centos/mini-e2e-helm/k8s-1.25 |
/test ci/centos/mini-e2e-helm/k8s-1.26 |
/test ci/centos/mini-e2e/k8s-1.23 |
/test ci/centos/mini-e2e/k8s-1.24 |
/test ci/centos/mini-e2e/k8s-1.25 |
/test ci/centos/mini-e2e/k8s-1.26 |
/test ci/centos/upgrade-tests-cephfs |
/test ci/centos/upgrade-tests-rbd |
Describe what this PR does
This commit adds new cmdline args
enable-read-affinity
and
crush-location-labels
to facilitate new feature ofusing crush location labels from node labels to enable
read affinity using ceph's
read_from_replica=localize
feature. This feature redirect reads to the nearest OSD.
This will be currently used only during rbd map cmd.
This commit makes use of crush location labels from node
labels to supply
crush_location
andread_from_replica=localize
options during rbd map cmd. Using these options, ceph
will be able to redirect reads to the closest OSD,
improving performance.
This commit adds documentation about read affinity supported
for rbd volumes.