Remove the checkout dir if the checkout phase fails #812

lox · 2018-08-13T02:45:53Z

There are several ways a git repository checkout can become corrupted that are hard to detect (have tried). This manifests as failures from one of the several git commands we call, either the git remote set-url, or the git clean or the git checkout, etc. Previously we dealt with this problem by removing the checkout directory on certain failures, but we keep discovering new ones.

We retry the checkout process if it fails, so this removes the checkout after the first failure.

petemounce · 2018-08-13T09:37:37Z

bootstrap/bootstrap.go

+				// This removes the checkout dir, which means the next checkout
+				// will be a lot slower (clone vs fetch), but hopefully will
+				// allow the agent to self-heal
+				_ = b.removeCheckoutDir()


This LGTM.

I'd love to see the agent also maintain a bare clone elsewhere as a local mirror, to optimise away the network trip?

Me too, but the scope of that is a lot bigger I think.

This would be killer.

💯 totally agree. Would happily accept a PR if you'd like to see it sooner @petemounce @jgavris :) - but it's something we absolutely want to do in the future!

keithpitt · 2018-08-20T03:51:07Z

Hrm..yeah I can see why this change is a bit tricky. I'm trying to think of a failure that's an "OK" failure... (one that just needs to be retried, and not a full re-clone). Maybe a transient network failure if it's a super large git fetch?

lox · 2018-08-20T04:12:18Z

Hrm..yeah I can see why this change is a bit tricky. I'm trying to think of a failure that's an "OK" failure... (one that just needs to be retried, and not a full re-clone). Maybe a transient network failure if it's a super large git fetch?

Yup, although that was already broken somewhat by destroying the working dir on git clone failing.

keithpitt · 2018-08-20T04:26:29Z

But in a transient git fetch failure, the original git clone would have succeeded right? It'd just retry normally and then work the second time (or hopefully the third). Wouldn't this change mean that a regular repo that's working fine, be deleted on a future error that has nothing to do with the corruption of the repo?

lox · 2018-08-20T04:37:38Z

@keithpitt maybe, yeah, but it's also very hard to know what operations might corrupt your git repo.

petemounce · 2018-08-20T07:30:53Z

@lox that's a reassuring thing to need to say about a source control system :)

lox · 2018-08-20T07:46:33Z

Distributed systems are hard ;)

lox · 2018-10-01T03:47:19Z

A good example of a corrupted git checkout:

fatal: update_ref failed for ref 'HEAD': cannot lock ref 'HEAD': Unable to create '/var/lib/buildkite-agent/builds/buildkite-xxx-5/xxx/xxx/.git/HEAD.lock': File exists.

Another git process seems to be running in this repository, e.g.
--
  | an editor opened by 'git commit'. Please make sure all processes
  | are terminated then try again. If it still fails, a git process
  | may have crashed in this repository earlier:
  | remove the file manually to continue.

Process exited with 128.

This lock file was left by a crashed git process.

petemounce · 2018-10-19T14:00:01Z

Seen this again - any update on when this might get merged?

lox · 2018-10-25T02:11:34Z

@keithpitt and I talked about this at some length. Our concerns with this were that deleting checkouts on error is potentially very expensive, as a fresh checkout requires a slow clone. Any ephemeral network error that occurs in checkout will cause deletion with this change.

That said, we think that correctness and self-healing systems trumps fast checkouts. So, we're gonna merge this and then look at bare repositories and reference clones for checkout next. Thanks for your patience all ❤

petemounce · 2018-10-25T06:15:13Z

Thanks! I think that's the right choice. Broken checkouts burn or agents to an unrecoverable state. I didn't think of a way to detect this automatically, in a way that could distinguish between agent got into this state vs build script did.

The former is grounds for auto-burn-and-replace, the latter not, I thought.

Remove the checkout dir if the checkout phase fails

931197c

lox requested a review from keithpitt August 13, 2018 02:46

petemounce reviewed Aug 13, 2018

View reviewed changes

lox merged commit 7cee69a into master Oct 25, 2018

lox deleted the always-remove-checkout-dir-on-error branch October 25, 2018 02:11

This was referenced Oct 25, 2018

Feature: retry on text pattern(s) matched in build log output buildkite/feedback#413

Open

Persistent "fatal: reference is not a tree: <sha>" #581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the checkout dir if the checkout phase fails #812

Remove the checkout dir if the checkout phase fails #812

lox commented Aug 13, 2018

petemounce Aug 13, 2018

lox Aug 13, 2018

jgavris Aug 16, 2018

keithpitt Aug 20, 2018

keithpitt commented Aug 20, 2018

lox commented Aug 20, 2018

keithpitt commented Aug 20, 2018

lox commented Aug 20, 2018

petemounce commented Aug 20, 2018

lox commented Aug 20, 2018

lox commented Oct 1, 2018 •

edited

Loading

petemounce commented Oct 19, 2018

lox commented Oct 25, 2018

petemounce commented Oct 25, 2018

Remove the checkout dir if the checkout phase fails #812

Remove the checkout dir if the checkout phase fails #812

Conversation

lox commented Aug 13, 2018

petemounce Aug 13, 2018

Choose a reason for hiding this comment

lox Aug 13, 2018

Choose a reason for hiding this comment

jgavris Aug 16, 2018

Choose a reason for hiding this comment

keithpitt Aug 20, 2018

Choose a reason for hiding this comment

keithpitt commented Aug 20, 2018

lox commented Aug 20, 2018

keithpitt commented Aug 20, 2018

lox commented Aug 20, 2018

petemounce commented Aug 20, 2018

lox commented Aug 20, 2018

lox commented Oct 1, 2018 • edited Loading

petemounce commented Oct 19, 2018

lox commented Oct 25, 2018

petemounce commented Oct 25, 2018

lox commented Oct 1, 2018 •

edited

Loading