Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArtifactUploader API calls: faster timeout & retry #2028

Merged
merged 2 commits into from
Mar 23, 2023

Conversation

pda
Copy link
Member

@pda pda commented Mar 22, 2023

When ArtifactUploader calls UpdateArtifacts (PUT /v3/jobs/{uuid}/artifacts), in the uncommon event of something stalling server-side or network-wise, the server and the client each take 60 seconds to time out, and the retry would wait a further 5 seconds. UpdateArtifacts is idempotent and called inside a retry loop, so we're better off timing out sooner and retrying sooner.

The p99 response time for this endpoint is ~50ms, so timing out after 5 seconds (100x the p99) seems more than reasonable. And reducing the initial retry delay (but backing off exponentially) also keeps things moving faster.

This means if buildkite-agent artifact upload … is invoked with a reasonable external timeout of e.g. 30 or 60 seconds, a single server/network stall won't use up all that time, and a repeated server/network stall will be more visible due to the timeouts being logged.

Only the verified-idempotent UpdateArtifacts API call is updated in this pull request, although the same treatment could be applied to other endpoints, e.g. the API call that creates the artifact batch is also idempotent, but out of scope for this PR.

Update: I've also applied a timeout (10 seconds) to ArtifactBatchCreator making the CreateArtifacts call; that one has a longer p99/max duration server side, hence the longer context timeout for this one. I've verified it is indeed idempotent. Also, the agent splits large numbers of artifacts into batches of up to 30, and this timeout is per-batch, so it won't scale beyond that with the total number of artifacts being uploaded.

// TODO: roko.ExponentialSubsecond(1*time.Second) WithMaxAttempts(10)
// Meanwhile, 8 roko.Exponential(2sec) attempts is 1,2,4,8,16,32,64 seconds delay (~2 mins)
roko.WithMaxAttempts(8),
roko.WithStrategy(roko.Exponential(2*time.Second, 0)),
).DoWithContext(ctx, func(r *roko.Retrier) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

				).DoWithContext(ctx, func(r *roko.Retrier) error {
					timeoutCtx, cancel := context.WithTimeout(ctx, 500 * time.Millisecond)
					defer cancel()

					if _, err := a.apiClient.UpdateArtifacts(timeoutCtx, a.conf.JobID, statesToUpload); err != nil {

You should be able to use context.WithTimeout to avoid having to set the timeout on the http client. Just make sure you make a new context and only use it in this request attempt!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice, thanks 👍🏼

Although I see one downside… here's the error message with HTTP timeout:

2023-03-22 23:15:43 WARN Put "https://…/v3/jobs/0187095a-5c74-4e20-be5e-1c266391fd10/artifacts": net/http: request canceled (Client.Timeout exceeded while awaiting headers) (Attempt 1/10 Retrying immediately)

And here's the error message with context timeout:

2023-03-23 11:26:12 WARN Put "https://…/v3/jobs/01870bf7-3aae-4b59-84e3-8784629d46f9/artifacts": context deadline exceeded (Attempt 1/8 Retrying in 1s)

The HTTP timeout error message is certainly nicer. But I do like how direct and close-to-the-relevant-action the context timeout is, compared to the HTTP one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to look into something like this: https://go.dev/blog/http-tracing. Instrumenting the client with http tracing when in debug mode should give an even better view of where this gets stuck.

Looks like you can also localise it using contexts, so we may want to make it super specific to start with.

Previously 60 sec timeout, plus 5 sec until first retry.
Now 5 sec timeout, plus 1 sec until first retry.
@pda pda force-pushed the artifact-upload-timeout-and-retry-faster branch from 9f63770 to d042ce1 Compare March 23, 2023 01:05
@pda pda changed the title ArtifactUploader.apiClient times out and retries faster ArtifactUploader UpdateArtifacts API: faster timeout & retry Mar 23, 2023
@pda pda marked this pull request as ready for review March 23, 2023 01:13
Was previously using default HTTP client timeout of 60 seconds.
@pda
Copy link
Member Author

pda commented Mar 23, 2023

Added to PR description:

Update: I've also applied a timeout (10 seconds) to ArtifactBatchCreator making the CreateArtifacts call; that one has a longer p99/max duration server side, hence the longer context timeout for this one. I've verified it is indeed idempotent. Also, the agent splits large numbers of artifacts into batches of up to 30, and this timeout is per-batch, so it won't scale beyond that with the total number of artifacts being uploaded.

@pda pda changed the title ArtifactUploader UpdateArtifacts API: faster timeout & retry ArtifactUploader API calls: faster timeout & retry Mar 23, 2023
@pda
Copy link
Member Author

pda commented Mar 23, 2023

I've also applied a timeout (10 seconds) to ArtifactBatchCreator making the CreateArtifacts call

Here's what that looks like when running in a build job, with a simulated slow response from CreateArtifacts:

image

Copy link
Contributor

@triarius triarius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Maybe make a follow-ups for the subsecond backoff and httptrace if you think you need them.

@pda pda merged commit 98e63e8 into main Mar 23, 2023
@pda pda deleted the artifact-upload-timeout-and-retry-faster branch March 23, 2023 05:55
@pda
Copy link
Member Author

pda commented May 4, 2023

These artifact retry delay have been further hastened by #2069

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants