Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the checkout dir if the checkout phase fails #812

Merged
merged 1 commit into from
Oct 25, 2018

Conversation

lox
Copy link
Contributor

@lox lox commented Aug 13, 2018

There are several ways a git repository checkout can become corrupted that are hard to detect (have tried). This manifests as failures from one of the several git commands we call, either the git remote set-url, or the git clean or the git checkout, etc. Previously we dealt with this problem by removing the checkout directory on certain failures, but we keep discovering new ones.

We retry the checkout process if it fails, so this removes the checkout after the first failure.

@lox lox requested a review from keithpitt August 13, 2018 02:46
// This removes the checkout dir, which means the next checkout
// will be a lot slower (clone vs fetch), but hopefully will
// allow the agent to self-heal
_ = b.removeCheckoutDir()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM.

I'd love to see the agent also maintain a bare clone elsewhere as a local mirror, to optimise away the network trip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too, but the scope of that is a lot bigger I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be killer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 totally agree. Would happily accept a PR if you'd like to see it sooner @petemounce @jgavris :) - but it's something we absolutely want to do in the future!

@keithpitt
Copy link
Member

Hrm..yeah I can see why this change is a bit tricky. I'm trying to think of a failure that's an "OK" failure... (one that just needs to be retried, and not a full re-clone). Maybe a transient network failure if it's a super large git fetch?

@lox
Copy link
Contributor Author

lox commented Aug 20, 2018

Hrm..yeah I can see why this change is a bit tricky. I'm trying to think of a failure that's an "OK" failure... (one that just needs to be retried, and not a full re-clone). Maybe a transient network failure if it's a super large git fetch?

Yup, although that was already broken somewhat by destroying the working dir on git clone failing.

@keithpitt
Copy link
Member

But in a transient git fetch failure, the original git clone would have succeeded right? It'd just retry normally and then work the second time (or hopefully the third). Wouldn't this change mean that a regular repo that's working fine, be deleted on a future error that has nothing to do with the corruption of the repo?

@lox
Copy link
Contributor Author

lox commented Aug 20, 2018

@keithpitt maybe, yeah, but it's also very hard to know what operations might corrupt your git repo.

@petemounce
Copy link
Contributor

@lox that's a reassuring thing to need to say about a source control system :)

@lox
Copy link
Contributor Author

lox commented Aug 20, 2018

Distributed systems are hard ;)

@lox
Copy link
Contributor Author

lox commented Oct 1, 2018

A good example of a corrupted git checkout:

fatal: update_ref failed for ref 'HEAD': cannot lock ref 'HEAD': Unable to create '/var/lib/buildkite-agent/builds/buildkite-xxx-5/xxx/xxx/.git/HEAD.lock': File exists.

Another git process seems to be running in this repository, e.g.
--
  | an editor opened by 'git commit'. Please make sure all processes
  | are terminated then try again. If it still fails, a git process
  | may have crashed in this repository earlier:
  | remove the file manually to continue.

Process exited with 128.

This lock file was left by a crashed git process.

@petemounce
Copy link
Contributor

Seen this again - any update on when this might get merged?

@lox
Copy link
Contributor Author

lox commented Oct 25, 2018

@keithpitt and I talked about this at some length. Our concerns with this were that deleting checkouts on error is potentially very expensive, as a fresh checkout requires a slow clone. Any ephemeral network error that occurs in checkout will cause deletion with this change.

That said, we think that correctness and self-healing systems trumps fast checkouts. So, we're gonna merge this and then look at bare repositories and reference clones for checkout next. Thanks for your patience all ❤

@lox lox merged commit 7cee69a into master Oct 25, 2018
@lox lox deleted the always-remove-checkout-dir-on-error branch October 25, 2018 02:11
@petemounce
Copy link
Contributor

Thanks! I think that's the right choice. Broken checkouts burn or agents to an unrecoverable state. I didn't think of a way to detect this automatically, in a way that could distinguish between agent got into this state vs build script did.

The former is grounds for auto-burn-and-replace, the latter not, I thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants