-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[test] Add e2e downgrade automatic cancellation test #19399
base: main
Are you sure you want to change the base?
Conversation
995bf3e
to
60e8a40
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted filessee 25 files with indirect coverage changes @@ Coverage Diff @@
## main #19399 +/- ##
==========================================
+ Coverage 68.85% 68.90% +0.05%
==========================================
Files 420 420
Lines 35762 35762
==========================================
+ Hits 24624 24643 +19
+ Misses 9714 9702 -12
+ Partials 1424 1417 -7 Continue to review full report in Codecov by Sentry.
|
60e8a40
to
11357be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@henrybear327 can you please rebase this PR ? I just merged #19398 |
11357be
to
c0f2170
Compare
Saw and done as you posted! |
time.Sleep(etcdserver.HealthInterval) | ||
} | ||
|
||
e2e.DowngradeAutoCancelCheck(t, epc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be flaky, it depends on when the downgrade job gets triggered. A simple way is just to monitor/wait for the log entry "the cluster has been downgraded"
etcd/server/etcdserver/version/monitor.go
Line 143 in f30cbaa
m.lg.Info("the cluster has been downgraded", zap.String("cluster-version", targetVersion)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's better to
- monitor/wait for the log entry "the cluster has been downgraded"
- call
downgradeCancellation
to verify that it's indeed cancelled
Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first item is required, the second one is nice to have (optional).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Will impl. in a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, when you see the log message "the cluster has been downgraded
", the leader hasn't send out the downgradeCancel request yet. So when you try to send downgradeCancel yourself in e2e.DowngradeAutoCancelCheck
, it may succeed. I think it's easy to verify if you add a sleep in between.
But in practice, it's unlikely because there is a 5s (etcdserver.HealthInterval
) sleep after the downgrade completion.
So either just fix comment https://github.com/etcd-io/etcd/pull/19399/files#r1957283593 or just remove e2e.DowngradeAutoCancelCheck
(preferred)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest add a TODO item to followup once the downgrade query API is implemented. #19439 (comment)
c0f2170
to
43e04f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please resolve the comments
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: henrybear327 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
43e04f9
to
a18de65
Compare
7baf245
to
f39e84e
Compare
f39e84e
to
121fa72
Compare
tests/framework/e2e/downgrade.go
Outdated
if opString == "downgrading" && len(membersToChange) == len(clus.Procs) { | ||
lg.Info("Waiting for downgrade completion log line") | ||
leader := clus.WaitLeader(t) | ||
_, err := clus.Procs[leader].Logs().ExpectWithContext(context.TODO(), expect.ExpectedResponse{Value: "the cluster has been downgraded"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please set a timeout, i.e. 30s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done as requested!
Do you see a risk with this approach of monitoring the leader's log @ahrtr? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only concern is it will fail if the leader somehow changes.
So suggest to add a retry mechanism (i.e. up to 3 times), and let's wait 15s instead of 30s in each retry.
121fa72
to
24d1fc1
Compare
Is this PR still relevant as we are doing #19439? |
Yes, I think so. The auto downgrade cancel should still work. |
FYI. #19451 (comment) Thanks for the effort anyway. @henrybear327 |
3.5 has the auto downgrade cancellation as well, so this test case still makes sense. @fuweid @henrybear327 https://github.com/etcd-io/etcd/blob/release-3.5/server/etcdserver/server.go#L2724 |
But 3.5 doesn't have the downgrade status query API. So this PR should be independent to #19451. So there is no need to wait for that PR. @fuweid @henrybear327 @siyuanfoundation |
So ... is the conclusion to still merge this PR or no? Sorry that I am pretty lost during these exchanges. |
YES. It's nice to merge this PR before next Tuesday. But it isn't a blocker for 3.6.0-rc.1. |
24d1fc1
to
42e29c2
Compare
Verify that the downgrade can be cancelled automatically when the downgrade is completed (using `no inflight downgrade job`` as the indicator) Please see: etcd-io#19365 (comment) Reference: etcd-io#17976 Signed-off-by: Chun-Hung Tseng <[email protected]>
42e29c2
to
65ae06d
Compare
Got it! I have rebased it now, but I will likely complete addressing all comments and incorporating changes regarding cancellation status by Monday noon (as I am out on both days on the weekend). Marking the PR as a draft now. |
Verify that the downgrade can be cancelled automatically when the downgrade is completed (using
no inflight downgrade job
as the indicator)Please see #19365 (comment)
Reference: #17976
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.