-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the checkout dir if the checkout phase fails #812
Conversation
// This removes the checkout dir, which means the next checkout | ||
// will be a lot slower (clone vs fetch), but hopefully will | ||
// allow the agent to self-heal | ||
_ = b.removeCheckoutDir() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM.
I'd love to see the agent also maintain a bare
clone elsewhere as a local mirror, to optimise away the network trip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me too, but the scope of that is a lot bigger I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be killer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 totally agree. Would happily accept a PR if you'd like to see it sooner @petemounce @jgavris :) - but it's something we absolutely want to do in the future!
Hrm..yeah I can see why this change is a bit tricky. I'm trying to think of a failure that's an "OK" failure... (one that just needs to be retried, and not a full re-clone). Maybe a transient network failure if it's a super large |
Yup, although that was already broken somewhat by destroying the working dir on git clone failing. |
But in a transient |
@keithpitt maybe, yeah, but it's also very hard to know what operations might corrupt your git repo. |
@lox that's a reassuring thing to need to say about a source control system :) |
Distributed systems are hard ;) |
A good example of a corrupted git checkout:
This lock file was left by a crashed git process. |
Seen this again - any update on when this might get merged? |
@keithpitt and I talked about this at some length. Our concerns with this were that deleting checkouts on error is potentially very expensive, as a fresh checkout requires a slow clone. Any ephemeral network error that occurs in checkout will cause deletion with this change. That said, we think that correctness and self-healing systems trumps fast checkouts. So, we're gonna merge this and then look at bare repositories and reference clones for checkout next. Thanks for your patience all ❤ |
Thanks! I think that's the right choice. Broken checkouts burn or agents to an unrecoverable state. I didn't think of a way to detect this automatically, in a way that could distinguish between agent got into this state vs build script did. The former is grounds for auto-burn-and-replace, the latter not, I thought. |
There are several ways a git repository checkout can become corrupted that are hard to detect (have tried). This manifests as failures from one of the several git commands we call, either the
git remote set-url
, or thegit clean
or thegit checkout
, etc. Previously we dealt with this problem by removing the checkout directory on certain failures, but we keep discovering new ones.We retry the checkout process if it fails, so this removes the checkout after the first failure.