Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce resumable downloads with --resume-retries #12991

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

gmargaritis
Copy link

@gmargaritis gmargaritis commented Oct 4, 2024

Resolves #4796

Introduced the --resume-retries option in order to allow resuming incomplete downloads incase of dropped or timed out connections.

This option additionally uses the values specified for --retries and --timeout for each resume attempt, since they are passed in the session.

Used 0 as the default in order to keep backwards compatibility.

This PR is based on #11180

The downloader will make new requests and attempt to resume downloading using a Range header. If the initial response includes an ETag (preferred) or Date header, the downloader will ask the server to resume downloading only when it is safe (i.e., the file hasn't changed since the initial request) using an If-Range header.

If the server responds with a 200 (e.g. if the server doesn't support partial content or can't check if the file has changed), the downloader will restart the download (i.e. start from the very first byte); if the server responds with a 206 Partial Content, the downloader will resume the download from the partially downloaded file.

yichi-yang and others added 3 commits September 26, 2024 21:26
- Added —resume-retries option to allow resuming incomplete downloads
- Setting —resume-retries=N allows pip to make N attempts to resume downloading, in case of dropped or timed out connections
- Each resume attempt uses the values specified for —retries and —timeout internally

Signed-off-by: gmargaritis <[email protected]>
@gmargaritis
Copy link
Author

I'm guessing the CI fails because of the new linter rules introduced in 102d818

@thk686
Copy link

thk686 commented Oct 4, 2024

Does this do rsync-style checksums? That would increase reliability.

@notatallshaw
Copy link
Member

I'm guessing the CI fails because of the new linter rules introduced in 102d818

This is CI fix, failing until it's merged: #12964

@gmargaritis
Copy link
Author

Hey @notatallshaw 👋

Is there anything that I can do to move this one forward?

@notatallshaw
Copy link
Member

notatallshaw commented Dec 11, 2024

Is there anything that I can do to move this one forward?

A pip maintainer needs to take up the task of reviewing it, as we're all volunteers it's a matter of finding time.

I think my main concern would be the behavior when interacting with index servers that behave badly, e.g. give the wrong content length (usually 0). Your description looks good to me, but I haven't had time to look over the code yet.

@gmargaritis
Copy link
Author

A pip maintainer needs to take up the task of reviewing it, as we're all volunteers it's a matter of finding time.

Yeah, I know how it goes, so no worries!

If you need any clarifications or would like me to make changes, I'd be happy to help!

@art-ignatev
Copy link

any chances that it'll be merged soon?

@notatallshaw notatallshaw added this to the 25.1 milestone Feb 1, 2025
@notatallshaw
Copy link
Member

I've had an initial cursory glace at this PR and it appears to be sufficiently high quality.

I've also run the functionality locally (select a large wheel to download and then disconnect my WiFi midway through the download) and it has a good UX.

My main concern, although this is a ship that has probably sailed, is it would be nice for pip not to have to directly handle HTTP intricacies and leave that to a separate library.

I can’t promise a full review or other maintainers will agree, but I am adding it to the 25.1 milestone for it to be tracked.

@pfmoore
Copy link
Member

pfmoore commented Feb 1, 2025

The PR looks good, although I’m not a http expert so I can’t comment on details like status and header handling. Like @notatallshaw I wish we could leave this sort of detail to a 3rd party library, but that would be a major refactoring. Add this PR (along with cert handling, parallel downloads, etc) to the list of reasons we should consider such a refactoring, but in the meantime I’m in favour of adding this.

@pfmoore
Copy link
Member

pfmoore commented Feb 1, 2025

There isn’t an “approve with conditions” button, but I approve this change on the basis that someone who understands http should check the header and status handling.

@ichard26
Copy link
Member

ichard26 commented Feb 1, 2025

I'll tack this onto my to-do list. Not sure if I can call myself a HTTP expert, but I've done a fair bit of webdev as a hobby so I'm decently familiar with HTTP statuses and header handling.

Sorry for taking so long to review. Large PRs like these are appreciated since they do often implement major improvements, but they're also tedious to review and pretty daunting. Not really a good excuse, but that's how it feels. Thanks @notatallshaw for the initial pass and confirming this is worth the look.

@ichard26 ichard26 self-requested a review February 1, 2025 19:23
@gmargaritis
Copy link
Author

Awesome! Thank you for all your efforts!

Don’t worry about it, I know how it feels! Let me know if you need anything ✌️

@Ibrahima-prog
Copy link

Hopefully this gets added soon, downloading GBs of stuff over slow internet and then having to restart from the beginning is not an experience i would recommend

@ichard26
Copy link
Member

ichard26 commented Mar 2, 2025

@Ibrahima-prog I hear ya! This is on my radar to review. I haven't gotten around to it yet. And truthfully, I probably won't find the time until at least next Thursday. This will make it into the pip 25.1 release.

Copy link
Member

@notatallshaw notatallshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through the code a couple of times and tried to educate myself on the relevant HTTP tags, as well as trying it against PyPI. Overall I'm happy with this PR, but I have left a few comments on edge cases that I would like you to address or provide thoughts on.

On the topic of edge cases, here's an example where a range request is possible on the index but the HEAD request to check returns a 405: astral-sh/uv#11379, I don't think this PR is affected by this behavior, but it's an interesting example of how edge casey all this behavior can be.

Comment on lines +190 to +193
if bytes_received < total_length:
self._attempt_resume(
resp, link, content_file, total_length, bytes_received, filepath
)
Copy link
Member

@notatallshaw notatallshaw Mar 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one concern and one nitpick here:

Concern: It looks like previously pip was only using the total_length for the progress bar, it was not validating that the download actually matched the total length. Should we be concerned that there are people using buggy HTTP servers that provide the wrong Content-Length?

While I do like erroring out on the download clearly being incomplete / wrong, I think it's the correct default behavior. But if users do complain that this breaks using pip for them, what do we tell them? Should we provide an escape hatch for users of broken HTTP servers? I appreciate this is a hypothetical.

Nitpick: It would nice that if self._resume_retries is 0 that self._attempt_resume is never called.

Copy link
Member

@ichard26 ichard26 Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if users do complain that this breaks using pip for them, what do we tell them? Should we provide an escape hatch for users of broken HTTP servers? I appreciate this is a hypothetical.

I'm going to say that we can release this as-is, and if enough people complain, then we can add an escape hatch. I don't want to add flags prematurely. My gut is that at least one person is going to complain, but they really should fix their HTTP server.

Nitpick: It would nice that if self._resume_retries is 0 that self._attempt_resume is never called.

Agreed 👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to say that we can release this as-is, and if enough people complain, then we can add an escape hatch. I don't want to add flags prematurely. My gut is that at least one person is going to complain, but they really should fix their HTTP server.

That were my thoughts as well 👆

self._resume_retries can not be explicitly set to 0 by the user, thus self._attempt_resume is not called:

https://github.com/gmargaritis/pip/blob/d7942ef1a4ac356374e995bc56e523b45f425b11/src/pip/_internal/network/download.py#L165-L167

Also, we terminate the loop when there are 0 retries left:

https://github.com/gmargaritis/pip/blob/d7942ef1a4ac356374e995bc56e523b45f425b11/src/pip/_internal/network/download.py#L255-L256

Comment on lines 818 to 823
def __init__(self, link: str, resume_retries: int) -> None:
message = (
f"Download failed after {resume_retries} attempts because not enough"
" bytes were received. The incomplete file has been cleaned up."
)
hint = "Use --resume-retries to configure resume retry limit."
Copy link
Member

@notatallshaw notatallshaw Mar 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the number of bytes downloaded, or at least special case when there are 0 bytes downloaded and let the user know that no data was downloaded.

I have commonly seen the error when a corporate firewall allows an HTTP GET to start but blocks all the data and no data is downloaded, resulting in an empty file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in c93c72a

except (ConnectionError, ReadTimeoutError):
continue

if total_length and bytes_received < total_length:
Copy link
Member

@notatallshaw notatallshaw Mar 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about changing the < to !=?

I have seen corporate proxies return completely different data, such as a page saying the download is not allowed with the full text of the internal policies related to network traffic.

This would change the semantics of the error, something like DownloadError instead of IncompleteDownloadError, with verbiage related to it possibly being incomplete or a blocked network request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we more strongly suggest that the download was incomplete. I agree that we should mention the possibility that the response is total nonsense (yay for enterprise proxies), but the error message should emphasize the more likely culprit of an incomplete download. Perhaps we could check the response Content-Type and if it isn't what we're expecting, then we can assume the response is total nonsense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this logic is within the scope of resumable downloads, I’d suggest leaving this out of scope for now, as it affects pip’s overall download behavior, not just retries.

Changing < to != would alter the semantics of the error, making it more about detecting completely different responses rather than just incomplete downloads. If we want to handle cases where proxies return unexpected content (like policy pages), that should be considered holistically across all downloads, not just resumable ones.

For now, the retry mechanism should continue treating incomplete downloads as the primary concern. If the connection is stable, pip won’t crash, and a broader discussion would be needed for verifying response integrity (e.g., checking Content-Type).

Copy link
Member

@ichard26 ichard26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is a preliminary review consisting of feedback that immediately came to mind. I will need to review this again in more detail.)

Thank you so much for working on this! It's great to see the resumption of incomplete downloads make progress! I left some initial comments, but I have a larger question: why are the resume retries set to 0 by default?

I'd prefer if pip's behaviour emulated that of a browser where it caches the incomplete download so the download can be resumed at some later point, but I realize that would be significantly increase the complexity of the implementation (and it'd also introduce some friction as the download would have to be manually restarted).

However, defaulting to not resuming here results in poor UX IMO. Imagine I download half of a $very-large-wheel (e.g., PyTorch @ ~1 GB) and then the download is dropped. I get a incomplete-download-error explaining what happened. Good! I try again with --resume-retries 5 What's not so good is that pip will have to download all of said $very-large-wheel again.

If I'm on a slow or metered connection, I'd be frustrated that I have to download everything again. Doubly so if the connection failed at 90% or similar. In addition, it's not immediately clear how many resumption retries I should pass to pip. 1? 2?

Would be possible to default to allowing a few (1-3) resume attempts? That way, if the download fails halfway through, the download will be given another shot. It may not be enough if the connection is so unstable that it requires a ton of resumes, but for one-off failures, it would still be a major improvement. As long as the messaging is clear, I don't think automatic resumes would be that annoying to the user.1 I consider resumes as the preferred option and opting out of resumption to be an exceptional (but still important to support!) case.

Anyway, thank you again for your patience! I also appreciate all of the tests (although I have only scanned through them very briefly). Despite the flaws and my critiques, this is a major step forward, giving users a fighting chance to download large distributions on unstable connections.

Footnotes

  1. Although if resumes are the default, perhaps we shouldn't allow the download to restart from zero (i.e., range requests are NOT supported) multiple times? Downloading the whole file numerous times over could be very slow and surprising (especially for users on metered connections) and thus be something they need to opt into... although that would make the default special. A default of --resume-retries 3 would be treated differently from the user specifying --resume-retries 3

Comment on lines 142 to 147
# request a partial download
if range_start:
headers["Range"] = f"bytes={range_start}-"
# make sure the file hasn't changed
if if_range:
headers["If-Range"] = if_range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we enforce that these two parameters must be given simultaneously? While it is permissible to issue a range request without If-Range, it is generally inadvisable as then we lose the protection that the file hasn't been changed in between retries. For PyPI, this is unlikely to be a problem as the distribution files never change, and the index pages are so small that they will be rarely retried, but unless there is good reason to, I'd prefer requiring the safety net of If-Range.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 66f68ca


return bytes_received

def _attempt_resume(
Copy link
Member

@ichard26 ichard26 Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for total_length to be None and resumption to still function? AFAICT by reading the current logic, no. The annotation can be changed to int and the tests for total_length can be dropped in the function's body.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed not possible for the resumption to still function if total_length is None, but since _attempt_resume relies on _get_http_response_size among other things, we have to define it as Optional[int]

https://github.com/gmargaritis/pip/blob/d7942ef1a4ac356374e995bc56e523b45f425b11/src/pip/_internal/network/download.py#L25-L29

except (ConnectionError, ReadTimeoutError):
continue

if total_length and bytes_received < total_length:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we more strongly suggest that the download was incomplete. I agree that we should mention the possibility that the response is total nonsense (yay for enterprise proxies), but the error message should emphasize the more likely culprit of an incomplete download. Perhaps we could check the response Content-Type and if it isn't what we're expecting, then we can assume the response is total nonsense?

@ichard26
Copy link
Member

Does this do rsync-style checksums? That would increase reliability.

@thk686 I'm very late to the party, but could you elaborate on how checksums come into play? AFAIK, indices don't serve the checksums of their distributions so there is no way pip could double check the download wasn't corrupted unless the checksums were given by the user. This PR uses conditional range requests (via the If-Range HTTP request header) which will avoid the issue of the file being changed on the server in-between requests.

@thk686
Copy link

thk686 commented Mar 13, 2025 via email

@gmargaritis
Copy link
Author

gmargaritis commented Mar 16, 2025

@ichard26

I'd prefer if pip's behaviour emulated that of a browser where it caches the incomplete download so the download can be resumed at some later point, but I realize that would be significantly increase the complexity of the implementation (and it'd also introduce some friction as the download would have to be manually restarted).

There has been some discussion around this in the past1 and I’d pretty much prefer it. However, I think that it’s out of scope for this first step in implementing resumable downloads, considering the amount of work needed.

Would be possible to default to allowing a few (1-3) resume attempts? That way, if the download fails halfway through, the download will be given another shot. It may not be enough if the connection is so unstable that it requires a ton of resumes, but for one-off failures, it would still be a major improvement. As long as the messaging is clear, I don't think automatic resumes would be that annoying to the user.1 I consider resumes as the preferred option and opting out of resumption to be an exceptional (but still important to support!) case.

I initially set the default --resume-retries to 0 for backward compatibility and to get the discussion going. I agree that a low default (e.g., 2–5 attempts) would provide a better UX, but I'd also be cautious about changing pip’s default install behavior.

We have two options:

  1. Set a low default right away, as you suggested.
  2. Release it as-is, monitor and fix any issues that arise, and consider making it the default in a future version.

Footnotes

  1. https://github.com/pypa/pip/issues/4796#issuecomment-1153260254

Signed-off-by: gmargaritis <[email protected]>
(cherry picked from commit f2e48c3f5885305369b88761ab74cd16a0869667)
Signed-off-by: gmargaritis <[email protected]>
(cherry picked from commit 53ce184348de1af4937dc04de7a1aedbe4ede19a)
Signed-off-by: gmargaritis <[email protected]>
(cherry picked from commit 1f8d7fe0b0a5c7b53719bd8713619f982c042dbf)
Signed-off-by: gmargaritis <[email protected]>
(cherry picked from commit af6b7ac624ebc18035d2da217c4c1850a6850cd7)
Signed-off-by: gmargaritis <[email protected]>
(cherry picked from commit 67e366aec42d913436159ca3bf877c46a0d5cd2c)
@ichard26 ichard26 self-assigned this Mar 17, 2025
@ichard26
Copy link
Member

Just so everyone is on the same page, I plan on re-reviewing this PR sometime this week. I'm working on prototyping some code style changes which I'll share soon. Beyond that, I'd like to review the other parts of the resuming UX. After that, I should be happy enough with this to merge it and let any other suggestions be handled at a later date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Pip could resume download package at halfway the connection is poor
8 participants