pip offline behaviour #4753

davidhyman · 2017-10-02T12:57:24Z

Pip version: 9.0.1
Python version: 2.7.13
Operating system: windows 10

Description:

Installing packages from disk when offline:

When no internet connection is available, pip should reprioritise --find-links or local URIs.
Instead, what happens is it repeatedly tries and fails (with urllib3) to fetch from online URIs. It then falls back to using local ones. For every single package.

I feel this is not resolved using the --no-index as, if a connection were available, I would expect pip to make use of it.

I would expect one of these three behaviours instead:

attempt resolution of dependencies using indices in the order they were specified
attempt resolution of dependencies using local URIs first, then remote
attempt resolution of dependencies seemingly at random as at present, but if a URI returns a connection error don't keep trying it for every single package. Give another URI a chance!

What I've run:

# build a local repo
pip2pi cheeseshop -r requirements.txt
# install from local repo
pip install -r requirements.txt -ifile:\\C:\coding\pip_example\cheeseshop\simple\

I have also tried numerous variations on this including pip download (i.e. using the pip download cache instead of a local repo) and others. -i vs --extra-index-url etc. Seemingly whichever combination or ordering of parameters, it will try to hit the internet first.

> pip install -r requirements.txt --extra-index-url=file:\\C:\coding\pip_example\cheeseshop\simple\
Collecting bottle<=1 (from -r requirements.txt (line 1))
  Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x04364410>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',)': /simple/bottle/
<SNIP>
  Retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x04364590>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',)': /simple/bottle/
Collecting nose (from -r requirements.txt (line 2))
  Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x05042BF0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',)': /simple/nose/
<SNIP>
<successful install of all requirements>

The text was updated successfully, but these errors were encountered:

pradyunsg · 2017-10-02T21:16:30Z

I'm curious what your use case/motivation here is...

I feel it to be more deterministic to download the required files (with a guaranteed internet connection) and then install these downloaded packages (using --no-index and --find-links)?

davidhyman · 2017-10-03T08:52:09Z

So a few cases, that should all work using the same command:

full internet connection
with minimal / partial usage
no internet connection (use packages already provided)

Use cases might be:

package provision and installation as a method of deployment for offline users, but with the opportunity to update packages when a connection becomes available
the same, for local development (e.g. working on a train)

I suppose that it could boil down to not having to provide different commands for pip based on the state of the network connection. In particular, I feel that the commands already available aren't being used intuitively in the case that the system is offline (namely the find-links and extra-index-url).

I don't think this would change the determinism of pip, because if a user has explicitly specified a find-links directory without --no-index, then it was intentional - it seems these two are always used together. It seems that the use case for having them independent (i.e. only using find-links) is to allow the user to have an unpublished package locally that can be installed with pip - I don't think this behaviour would be changed when improving the offline behaviour.

pradyunsg · 2017-10-03T16:14:50Z

ref: #4321 maybe.

(I'm not the best person to comment on this; I'll let someone else take it from here)

dstufft · 2017-10-03T18:16:51Z

This is unlikely to be something that we're going to do. It's not possible for us to know if you mean to have a network connection or not, so the only thing we can do is what we do now-- try and eventually timeout.

This use case you're talking about is probably better suited to running a local DevPI mirror, which will let you work in offline and online cases and will automatically keep your local package cache up to date.

davidhyman · 2017-10-03T21:57:34Z

I appreciate your response but I suspect I failed to explain what is actually causing the problem (I recommend trying the repro steps in the PR to see what the pain point is). In particular, it's not that we have to detect whether we have an internet connection, just that an index is unreachable, and typically that is unlikely to change over the course of a pip install sequence.

As such, I've investigated a way of getting this working. I hope the PR referenced above clearly demonstrates a solution. This might more adequately demonstrate the core behaviour that was causing excessively long installation times ("don't keep trying it for every single package" as above).

I would maintain that usage of pip offline is still a valid use case even in the event that a DevPI mirror is available - because the only way to avoid the excessive timeouts is to use -i rather than --extra-index-url, which means using a different command depending on the state of my connection / availability of any given index.

FWIW there's quite a lot of questions on SO about usage offline - without examining every one, it seems to me that offline is definitely one of the use cases for pip, even if it was not originally intended to be.
https://stackoverflow.com/search?q=pip+offline

pfmoore · 2017-10-03T22:29:26Z

There's a dilemma here in that if we mark a URL as invalid when it times out (and so improve the time for offline use) we also cause a transient problem to block any use of the same URL, when later uses might have worked.

And I don't think there's many people who need to use pip while offline, and yet for some reason cannot simply add --no-index when doing so. Also, you could try adding --timeout 1 which reduces the default timeout from 15 sec to 1 sec, so reducing any "excessive timeout" you see by a factor of 15.

Finally, while I haven't gone through the SO search you posted in detail, I'd imagine a lot of those users would be completely satisfied with --no-index --find-links. Have you evidence to show otherwise?

davidhyman · 2017-10-04T07:57:53Z

Thanks for your response, I'll try and answer each of your points. I appreciate that triaging is hard and possibly thankless work, but I assure you I appreciate it and have thought about this.

transient problem

Yes, I agree conceptually - when corner-case hunting this pops up straight away. But I'm not sure in reality that this would have much effect.
With two valid repos, A, B, I could get my packages x,y,z from them in this sequence at present: x:A y:B z:A assuming that the timeframe of the transient behaviour (including retries and timeouts) is such that y has to switch repos. With blacklisting: x:A y:B z:B. So the only failure mode I actually see here beyond the current implementation is that they are mis-matched online repositories where z is only present in the first repository, and the failure is transitory across the period of running pip install (x:A, y:B, z:!!).

That corner case is pretty extreme.
Can some of your concerns be addressed by discussing implementation details? For example, we could:
- Append blacklisted hosts to the end of the candidates list and try them at the end, for packages like z in the scenario that didn't resolve?
- Provide a flag or switch such that accepting this corner case is opt-in
- Or adjust the sort order of the candidates to match the ordering provided by the user? This would improve the situation for other use-cases as noted in Missing an option to prioritize --find-links over --index-url #4321 (preferring local to remote, avoiding name squatting/spoofing etc).

simply add --no-index

So the user has to determine whether they (or any index) are offline and then change the commands to the program? I don't agree with this approach - it makes it harder to use, less reproducible and less scriptable. Robust programs should deal with network failure gracefully.

you could try adding

Have you tried running the example I posted in the PR? I used timeout there anyway for brevity. Two reasons this isn't ideal -

Timeouts are there for a reason - the server may be genuinely slow, or the internet connection flaky. Not everyone is on T1 backbone! Give the pixies a chance!
15 seems a lot, but assuming you have a large project, this is still O(N), instead of O(1). Incidentally I have a commit that adds a separate connect_timeout option to be passed to requests but don't want to confuse the issue.

Have you evidence to show otherwise

Sorry, I don't run a programmers collective or polling agency! But as a user it is patently better to have to take no action and have the program work, than take some action. If we add a switch, one could replace --no-index with --skip-hosts (ex.). If we manage without a switch, offline usage wouldn't even need --no-index. And they wouldn't need to change that command depending on where they are, time of day etc all because of connectivity. (without reiterating the use cases - this is just one of them, think of package distribution as well).

pfmoore · 2017-10-04T08:13:10Z

OK. Thanks for your detailed reply. However, I agree with @dstufft, this isn't something we're likely to implement. If someone were to raise a PR for this, it would be reviewed, but there's obviously no guarantee that it would be accepted even in that case.

davidhyman · 2017-10-04T08:25:54Z

I strongly suspect you're not reading my replies. There is already a PR here: #4763

pfmoore · 2017-10-04T08:40:37Z

No, I simply missed that. Thanks for the pointer.

pradyunsg added the S: awaiting response Waiting for a response/more information label Oct 2, 2017

pradyunsg removed the S: awaiting response Waiting for a response/more information label Oct 3, 2017

dstufft closed this as completed Oct 3, 2017

davidhyman mentioned this issue Oct 3, 2017

Add blacklisting for improved offline behaviour #4763

Closed

lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 3, 2019

lock bot locked as resolved and limited conversation to collaborators Jun 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pip offline behaviour #4753

pip offline behaviour #4753

davidhyman commented Oct 2, 2017

pradyunsg commented Oct 2, 2017

Uh oh!

davidhyman commented Oct 3, 2017

Uh oh!

pradyunsg commented Oct 3, 2017

Uh oh!

dstufft commented Oct 3, 2017

Uh oh!

davidhyman commented Oct 3, 2017 •

edited

Loading

Uh oh!

pfmoore commented Oct 3, 2017

Uh oh!

davidhyman commented Oct 4, 2017

Uh oh!

pfmoore commented Oct 4, 2017

Uh oh!

davidhyman commented Oct 4, 2017

Uh oh!

pfmoore commented Oct 4, 2017

Uh oh!

pip offline behaviour #4753

pip offline behaviour #4753

Comments

davidhyman commented Oct 2, 2017

Description:

What I've run:

pradyunsg commented Oct 2, 2017

Uh oh!

davidhyman commented Oct 3, 2017

Uh oh!

pradyunsg commented Oct 3, 2017

Uh oh!

dstufft commented Oct 3, 2017

Uh oh!

davidhyman commented Oct 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pfmoore commented Oct 3, 2017

Uh oh!

davidhyman commented Oct 4, 2017

Uh oh!

pfmoore commented Oct 4, 2017

Uh oh!

davidhyman commented Oct 4, 2017

Uh oh!

pfmoore commented Oct 4, 2017

Uh oh!

davidhyman commented Oct 3, 2017 •

edited

Loading