-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArtifactUploader API calls: faster timeout & retry #2028
Conversation
// TODO: roko.ExponentialSubsecond(1*time.Second) WithMaxAttempts(10) | ||
// Meanwhile, 8 roko.Exponential(2sec) attempts is 1,2,4,8,16,32,64 seconds delay (~2 mins) | ||
roko.WithMaxAttempts(8), | ||
roko.WithStrategy(roko.Exponential(2*time.Second, 0)), | ||
).DoWithContext(ctx, func(r *roko.Retrier) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
).DoWithContext(ctx, func(r *roko.Retrier) error {
timeoutCtx, cancel := context.WithTimeout(ctx, 500 * time.Millisecond)
defer cancel()
if _, err := a.apiClient.UpdateArtifacts(timeoutCtx, a.conf.JobID, statesToUpload); err != nil {
You should be able to use context.WithTimeout
to avoid having to set the timeout on the http client. Just make sure you make a new context and only use it in this request attempt!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice, thanks 👍🏼
Although I see one downside… here's the error message with HTTP timeout:
2023-03-22 23:15:43 WARN Put "https://…/v3/jobs/0187095a-5c74-4e20-be5e-1c266391fd10/artifacts": net/http: request canceled (Client.Timeout exceeded while awaiting headers) (Attempt 1/10 Retrying immediately)
And here's the error message with context timeout:
2023-03-23 11:26:12 WARN Put "https://…/v3/jobs/01870bf7-3aae-4b59-84e3-8784629d46f9/artifacts": context deadline exceeded (Attempt 1/8 Retrying in 1s)
The HTTP timeout error message is certainly nicer. But I do like how direct and close-to-the-relevant-action the context timeout is, compared to the HTTP one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to look into something like this: https://go.dev/blog/http-tracing. Instrumenting the client with http tracing when in debug mode should give an even better view of where this gets stuck.
Looks like you can also localise it using contexts, so we may want to make it super specific to start with.
Previously 60 sec timeout, plus 5 sec until first retry. Now 5 sec timeout, plus 1 sec until first retry.
9f63770
to
d042ce1
Compare
Was previously using default HTTP client timeout of 60 seconds.
Added to PR description:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Maybe make a follow-ups for the subsecond backoff and httptrace if you think you need them.
These artifact retry delay have been further hastened by #2069 |
When ArtifactUploader calls
UpdateArtifacts
(PUT /v3/jobs/{uuid}/artifacts
), in the uncommon event of something stalling server-side or network-wise, the server and the client each take 60 seconds to time out, and the retry would wait a further 5 seconds.UpdateArtifacts
is idempotent and called inside a retry loop, so we're better off timing out sooner and retrying sooner.The p99 response time for this endpoint is ~50ms, so timing out after 5 seconds (100x the p99) seems more than reasonable. And reducing the initial retry delay (but backing off exponentially) also keeps things moving faster.
This means if
buildkite-agent artifact upload …
is invoked with a reasonable external timeout of e.g. 30 or 60 seconds, a single server/network stall won't use up all that time, and a repeated server/network stall will be more visible due to the timeouts being logged.Only the verified-idempotent
UpdateArtifacts
API call is updated in this pull request, although the same treatment could be applied to other endpoints, e.g. the API call that creates the artifact batch is also idempotent,but out of scope for this PR.Update: I've also applied a timeout (10 seconds) to
ArtifactBatchCreator
making theCreateArtifacts
call; that one has a longer p99/max duration server side, hence the longer context timeout for this one. I've verified it is indeed idempotent. Also, the agent splits large numbers of artifacts into batches of up to 30, and this timeout is per-batch, so it won't scale beyond that with the total number of artifacts being uploaded.