-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change start sequence to match StS order, prevent failing pods to block start of others #754
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: This PR doesn't seem to behave as it should.
When I create a new cassdc, the operator will first start pods -0
, then it'll move to the last pod and go backwards.
I'd expect the last pod to be the first to start and have the operator go backwards then, with pod -0
being the last to start.
You're right, I actually modified only the startAllNodes, in a new cluster the first node has special process. |
I'll fix that one also. |
Just as a heads up (in case its relevant to this PR) I've seen a lot of issues where cassandra pods fail to come back up after a restart task is run on a DC. Typically, the symptoms are that no cassandra logs exist, and it looks like management API is running but hasn't started the Cassandra process. cass-operator then appears to continually restart and then delete the pod but no progress is ever made. There aren't typically any signs of trouble in the management API logs. This issue is intermittent and doesn't appear to be associated with any particular bad config, a node replacement task fixes it, as does simply force-deleting the pod (although that often then requires a few more automated restarts to avoid issues with the IP already existing and not having run a manual replacement). I mention this because the behaviour will cause flakes in CI when running the encryption tests (there are a lot of restarts) and has made debugging the tests a bit tricky. Not sure if this PR might help fix it, but wanted to put it on the radar while we're looking at the node start process. |
…est index first and then go lower. At the same time, if a pod has failed to start on the previous attempt, move it to the bottom of the starting sequence to prevent one failing node from blocking the start sequence of other pods (in a non-bootstrapped / stopped/start situation)
->
|
What this PR does:
Which issue(s) this PR fixes:
Fixes #750
Fixes #757
Checklist