Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change start sequence to match StS order, prevent failing pods to block start of others #754

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

burmanm
Copy link
Contributor

@burmanm burmanm commented Feb 10, 2025

What this PR does:

  • Prevent single bad behaving pod from blocking the start sequence in case of cluster stop/start or if new nodes are spawned.
  • Start from the highest index number first like StatefulSet controller does

Which issue(s) this PR fixes:
Fixes #750
Fixes #757

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

Sorry, something went wrong.

@burmanm burmanm changed the title Start sequence Change start sequence to match StS order, prevent failing pods to block start of others Mar 12, 2025
@burmanm burmanm marked this pull request as ready for review March 12, 2025 15:33
@burmanm burmanm requested a review from a team as a code owner March 12, 2025 15:33
Copy link
Contributor

@adejanovski adejanovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This PR doesn't seem to behave as it should.
When I create a new cassdc, the operator will first start pods -0, then it'll move to the last pod and go backwards.
I'd expect the last pod to be the first to start and have the operator go backwards then, with pod -0 being the last to start.

@burmanm
Copy link
Contributor Author

burmanm commented Mar 14, 2025

You're right, I actually modified only the startAllNodes, in a new cluster the first node has special process.

@burmanm
Copy link
Contributor Author

burmanm commented Mar 14, 2025

I'll fix that one also.

@Miles-Garnsey
Copy link
Member

Just as a heads up (in case its relevant to this PR) I've seen a lot of issues where cassandra pods fail to come back up after a restart task is run on a DC. Typically, the symptoms are that no cassandra logs exist, and it looks like management API is running but hasn't started the Cassandra process. cass-operator then appears to continually restart and then delete the pod but no progress is ever made. There aren't typically any signs of trouble in the management API logs.

This issue is intermittent and doesn't appear to be associated with any particular bad config, a node replacement task fixes it, as does simply force-deleting the pod (although that often then requires a few more automated restarts to avoid issues with the IP already existing and not having run a manual replacement).

I mention this because the behaviour will cause flakes in CI when running the encryption tests (there are a lot of restarts) and has made debugging the tests a bit tricky. Not sure if this PR might help fix it, but wanted to put it on the radar while we're looking at the node start process.

burmanm added 4 commits March 17, 2025 18:22

Verified

This commit was signed with the committer’s verified signature.
burmanm Michael Burman

Verified

This commit was signed with the committer’s verified signature.
burmanm Michael Burman
…est index first and then go lower. At the same time, if a pod has failed to start on the previous attempt, move it to the bottom of the starting sequence to prevent one failing node from blocking the start sequence of other pods (in a non-bootstrapped / stopped/start situation)

Verified

This commit was signed with the committer’s verified signature.
burmanm Michael Burman

Verified

This commit was signed with the committer’s verified signature.
burmanm Michael Burman
@burmanm
Copy link
Contributor Author

burmanm commented Mar 17, 2025

➜  cass-operator git:(start_sequence) ✗ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
cluster2-dc2-r1-sts-0   1/2     Running   0          118s
cluster2-dc2-r1-sts-1   1/2     Running   0          118s
cluster2-dc2-r1-sts-2   2/2     Running   0          118s
➜  cass-operator git:(start_sequence) ✗
➜  cass-operator git:(start_sequence) ✗ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
cluster2-dc2-r1-sts-0   1/2     Running   0          2m46s
cluster2-dc2-r1-sts-1   2/2     Running   0          2m46s
cluster2-dc2-r1-sts-2   2/2     Running   0          2m46s
➜  cass-operator git:(start_sequence) ✗

->

➜  cass-operator git:(start_sequence) ✗ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
cluster2-dc2-r1-sts-0   2/2     Running   0          4m37s
cluster2-dc2-r1-sts-1   2/2     Running   0          4m37s
cluster2-dc2-r1-sts-2   2/2     Running   0          4m37s
➜  cass-operator git:(start_sequence) ✗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants