Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental-max-learners flag #13377

Merged
merged 1 commit into from
Nov 15, 2021

Conversation

hexfusion
Copy link
Contributor

@hexfusion hexfusion commented Sep 30, 2021

This PR add support for adjustment of maxLearners (currently hardcoded as 1) via configuration flag --experimental-max-learners. As the value is a runtime configuration care was taken to ensure proper validation to reduce unexpected situations where the value was not set equally among all members. While it is technically possible to bootstrap a cluster with different values this is no different than the possibility of other important runtime configurations such as the heartbeat interval. In general, I don't see a direct need for dynamic reconfiguration during runtime. While I understand a general desire to limit learner counts from a possible performance standpoint I can't see a reason to change this very often thus requiring the value persisted to disk and exposed via API.

key points:

  • default is still the same maxLearner=1
  • flag is experimental

possible scenarios and expectations

  • existing cluster has N learners (--experimental-max-learners=N) and would like to reduce the config to N-1. In this case the learner must be promoted or removed reducing the number of learners before etcd will start with this configuration which will error ErrTooManyLearners.

  • existing cluster has N learners (--experimental-max-learners=N) and a new member has just been added to the cluster. The runtime configuration is defined as --experimental-max-learners=N-1. etcd will not start with error ErrTooManyLearners until the configuration meets the current learner counts until learners are promoted.

  • existing cluster has N learners (--experimental-max-learners=N) and would like to add another learner (N+1). This will result in the client returning ErrTooManyLearners.

use cases:

  • faster and safer cluster bootstrap, parallel vs serial addition of members during scale up. No quorum loss scaling from 1 -> 2.
  • horizontal and vertical scaling

@hexfusion hexfusion marked this pull request as draft September 30, 2021 17:06
@hexfusion hexfusion added the WIP label Sep 30, 2021
@hexfusion
Copy link
Contributor Author

cc @ptabor @gyuho @serathius this is still very early on but I wanted to get your input on the approach before I went any further. tl;dr admin should have the ability to define the number of learners allowed in cluster membership.

@hexfusion hexfusion force-pushed the add-learner-limit-flag branch 8 times, most recently from f32a1d2 to 4bb0b51 Compare November 3, 2021 21:24
@hexfusion hexfusion removed the WIP label Nov 3, 2021
@hexfusion hexfusion marked this pull request as ready for review November 3, 2021 21:24
@hexfusion hexfusion changed the title Add max-learner flag Add experimental-max-learner flag Nov 3, 2021
Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hexfusion hexfusion changed the title Add experimental-max-learner flag Add experimental-max-learners flag Nov 8, 2021
@hexfusion
Copy link
Contributor Author

@serathius @ptabor @chaochn47 PTAL

Copy link
Member

@spzala spzala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @hexfusion I have couple of comments but lgtm otherwise. Thanks!

@hexfusion hexfusion force-pushed the add-learner-limit-flag branch from 4bb0b51 to 8a160dc Compare November 9, 2021 13:57
@hexfusion hexfusion force-pushed the add-learner-limit-flag branch from 8a160dc to 63a1cc3 Compare November 9, 2021 14:52
@hexfusion
Copy link
Contributor Author

@spzala updated based on your comments.

Copy link
Member

@spzala spzala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
Thanks for quickly addressing my comments @hexfusion

@hexfusion
Copy link
Contributor Author

hexfusion commented Nov 11, 2021

cc @ptabor @serathius any thoughts here?

@serathius
Copy link
Member

serathius commented Nov 15, 2021

Looks great, one thought about configuration that is provided as local flag and could be problematic when is misconfigured (for example miconfigured max-learners could cause had to debug behavior on leader change), do we have a way for users to detect such cases? For example I imagine that we could expose a metric with hash of subset of configuration that is expected to match in cluster. This way users can create an alert to detect misconfiguration.

This would also help with supportability if we ask users to verify their cluster configuration in Issue template.

@hexfusion
Copy link
Contributor Author

hexfusion commented Nov 15, 2021

Looks great, one though about configuration that is provided as local flag and could be problematic when is misconfigured (for example miconfigured max-learners could cause had to debug behavior on leader change), do we have a way for users to detect such cases? For example I imagine that we could expose a metric with hash of subset of configuration that is expected to match in cluster. This way users can create an alert to detect misconfiguration.

This would also help with supportability if we ask users to verify their cluster configuration in Issue template.

Appreciate the input I think that is a great idea. If you don't mind I would like that to be a follow-up PR will start work on it this week.

@serathius
Copy link
Member

great input

Looks great, one though about configuration that is provided as local flag and could be problematic when is misconfigured (for example miconfigured max-learners could cause had to debug behavior on leader change), do we have a way for users to detect such cases? For example I imagine that we could expose a metric with hash of subset of configuration that is expected to match in cluster. This way users can create an alert to detect misconfiguration.
This would also help with supportability if we ask users to verify their cluster configuration in Issue template.

Appreciate the input I think that is a great idea. If you don't mind I would like that to be a follow-up PR will start work on it this week.

Sure, I was treating this as a separate feature.

@hexfusion hexfusion merged commit 29c3b0f into etcd-io:main Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants