Change start sequence to match StS order, prevent failing pods to block start of others #754

burmanm · 2025-02-10T17:44:54Z

What this PR does:

Prevent single bad behaving pod from blocking the start sequence in case of cluster stop/start or if new nodes are spawned.
Start from the highest index number first like StatefulSet controller does

Which issue(s) this PR fixes:
Fixes #750
Fixes #757

Checklist

Changes manually tested
Automated Tests added/updated
Documentation added/updated
CHANGELOG.md updated (not required for documentation PRs)
CLA Signed: DataStax CLA

adejanovski

Issue: This PR doesn't seem to behave as it should.
When I create a new cassdc, the operator will first start pods -0, then it'll move to the last pod and go backwards.
I'd expect the last pod to be the first to start and have the operator go backwards then, with pod -0 being the last to start.

burmanm · 2025-03-14T11:18:09Z

You're right, I actually modified only the startAllNodes, in a new cluster the first node has special process.

burmanm · 2025-03-14T11:18:15Z

I'll fix that one also.

Miles-Garnsey · 2025-03-15T07:08:53Z

Just as a heads up (in case its relevant to this PR) I've seen a lot of issues where cassandra pods fail to come back up after a restart task is run on a DC. Typically, the symptoms are that no cassandra logs exist, and it looks like management API is running but hasn't started the Cassandra process. cass-operator then appears to continually restart and then delete the pod but no progress is ever made. There aren't typically any signs of trouble in the management API logs.

This issue is intermittent and doesn't appear to be associated with any particular bad config, a node replacement task fixes it, as does simply force-deleting the pod (although that often then requires a few more automated restarts to avoid issues with the IP already existing and not having run a manual replacement).

I mention this because the behaviour will cause flakes in CI when running the encryption tests (there are a lot of restarts) and has made debugging the tests a bit tricky. Not sure if this PR might help fix it, but wanted to put it on the radar while we're looking at the node start process.

…est index first and then go lower. At the same time, if a pod has failed to start on the previous attempt, move it to the bottom of the starting sequence to prevent one failing node from blocking the start sequence of other pods (in a non-bootstrapped / stopped/start situation)

burmanm · 2025-03-17T16:40:40Z

➜  cass-operator git:(start_sequence) ✗ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
cluster2-dc2-r1-sts-0   1/2     Running   0          118s
cluster2-dc2-r1-sts-1   1/2     Running   0          118s
cluster2-dc2-r1-sts-2   2/2     Running   0          118s
➜  cass-operator git:(start_sequence) ✗

➜  cass-operator git:(start_sequence) ✗ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
cluster2-dc2-r1-sts-0   1/2     Running   0          2m46s
cluster2-dc2-r1-sts-1   2/2     Running   0          2m46s
cluster2-dc2-r1-sts-2   2/2     Running   0          2m46s
➜  cass-operator git:(start_sequence) ✗

->

➜  cass-operator git:(start_sequence) ✗ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
cluster2-dc2-r1-sts-0   2/2     Running   0          4m37s
cluster2-dc2-r1-sts-1   2/2     Running   0          4m37s
cluster2-dc2-r1-sts-2   2/2     Running   0          4m37s
➜  cass-operator git:(start_sequence) ✗

burmanm force-pushed the start_sequence branch from 3ac0fa9 to 15c32a9 Compare March 12, 2025 15:32

burmanm changed the title ~~Start sequence~~ Change start sequence to match StS order, prevent failing pods to block start of others Mar 12, 2025

burmanm marked this pull request as ready for review March 12, 2025 15:33

burmanm requested a review from a team as a code owner March 12, 2025 15:33

adejanovski requested changes Mar 14, 2025

View reviewed changes

burmanm added 4 commits March 17, 2025 18:22

Do not check the HostID of the pod if it is not ready

Verified

This commit was signed with the committer’s verified signature.

burmanm Michael Burman

SSH Key Fingerprint: JrVCvdHpTq2v5zpRa6m+SkkPHe1MjzaKYd8Y0gWbd+s
Verified
Learn about vigilant mode

936cadf

Modify changelog to include 757

Verified

This commit was signed with the committer’s verified signature.

burmanm Michael Burman

SSH Key Fingerprint: JrVCvdHpTq2v5zpRa6m+SkkPHe1MjzaKYd8Y0gWbd+s
Verified
Learn about vigilant mode

27213e1

burmanm force-pushed the start_sequence branch from 15c32a9 to d7de3d1 Compare March 17, 2025 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change start sequence to match StS order, prevent failing pods to block start of others #754

Change start sequence to match StS order, prevent failing pods to block start of others #754

burmanm commented Feb 10, 2025 •

edited

Loading

adejanovski left a comment

burmanm commented Mar 14, 2025

burmanm commented Mar 14, 2025

Miles-Garnsey commented Mar 15, 2025

burmanm commented Mar 17, 2025

Change start sequence to match StS order, prevent failing pods to block start of others #754

Are you sure you want to change the base?

Change start sequence to match StS order, prevent failing pods to block start of others #754

Conversation

burmanm commented Feb 10, 2025 • edited Loading

adejanovski left a comment

Choose a reason for hiding this comment

burmanm commented Mar 14, 2025

burmanm commented Mar 14, 2025

Miles-Garnsey commented Mar 15, 2025

burmanm commented Mar 17, 2025

burmanm commented Feb 10, 2025 •

edited

Loading