-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add experimental-max-learners flag #13377
Conversation
cc @ptabor @gyuho @serathius this is still very early on but I wanted to get your input on the approach before I went any further. tl;dr admin should have the ability to define the number of learners allowed in cluster membership. |
f32a1d2
to
4bb0b51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
We'll probably need an update on the docs for the experimental flags
https://etcd.io/docs/v3.5/op-guide/configuration/#experimental-features
https://github.com/etcd-io/website/blob/main/content/en/docs/v3.5/op-guide/configuration.md
@serathius @ptabor @chaochn47 PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @hexfusion I have couple of comments but lgtm otherwise. Thanks!
4bb0b51
to
8a160dc
Compare
Signed-off-by: Sam Batschelet <[email protected]>
8a160dc
to
63a1cc3
Compare
@spzala updated based on your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks for quickly addressing my comments @hexfusion
cc @ptabor @serathius any thoughts here? |
Looks great, one thought about configuration that is provided as local flag and could be problematic when is misconfigured (for example miconfigured max-learners could cause had to debug behavior on leader change), do we have a way for users to detect such cases? For example I imagine that we could expose a metric with hash of subset of configuration that is expected to match in cluster. This way users can create an alert to detect misconfiguration. This would also help with supportability if we ask users to verify their cluster configuration in Issue template. |
Appreciate the input I think that is a great idea. If you don't mind I would like that to be a follow-up PR will start work on it this week. |
Sure, I was treating this as a separate feature. |
This PR add support for adjustment of maxLearners (currently hardcoded as 1) via configuration flag
--experimental-max-learners
. As the value is a runtime configuration care was taken to ensure proper validation to reduce unexpected situations where the value was not set equally among all members. While it is technically possible to bootstrap a cluster with different values this is no different than the possibility of other important runtime configurations such as the heartbeat interval. In general, I don't see a direct need for dynamic reconfiguration during runtime. While I understand a general desire to limit learner counts from a possible performance standpoint I can't see a reason to change this very often thus requiring the value persisted to disk and exposed via API.key points:
possible scenarios and expectations
existing cluster has N learners (--experimental-max-learners=N) and would like to reduce the config to N-1. In this case the learner must be promoted or removed reducing the number of learners before etcd will start with this configuration which will error
ErrTooManyLearners
.existing cluster has N learners (--experimental-max-learners=N) and a new member has just been added to the cluster. The runtime configuration is defined as --experimental-max-learners=N-1. etcd will not start with error
ErrTooManyLearners
until the configuration meets the current learner counts until learners are promoted.existing cluster has N learners (--experimental-max-learners=N) and would like to add another learner (N+1). This will result in the client returning
ErrTooManyLearners
.use cases: