-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
gh-80789: Implement build-time pip bundling in ensurepip
#12791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Status update: during the EuroPython 2023 Sprints in Prague, I've started working on the conflict resolution here. There's been a number of changes to |
This comment was marked as resolved.
This comment was marked as resolved.
abc3fd1
to
4689729
Compare
ensurepip
313e6e1
to
f3960c3
Compare
I haven't had time to review the code fully (yet) but I do have a high level question. What is the intended workflow for a pip release once this change has been made? Currently, when we release a new version of pip, we create a PR for CPython that updates the bundled pip version. That PR is backported to all supported versions of CPython (where "supported" means both "still accepting PRs" and "supported by the new pip version"). With the new approach, what will we need to do when we release a new version of pip, if anything? Some specific scenarios:
Basically, these boil down to - can we override the default of "just take the latest pip version", and how do we communicate with the Python release team on this? |
@pfmoore There's no such default, and I haven't changed it. @AA-Turner was mistaken in his observation. There's still a hard-coded pip version in the source code: https://github.com/python/cpython/pull/12791/files#diff-f5e321563d5bc98371a960662f654a9387fd24f9bacb382d5b90d417ad2569c7R243-R244. The new thing is that a binary in Git is replaced by a SHA-256 hash. Whenever the new bundler is invoked, it checks the integrity of the existing (gitignored) pip wheel on disk. If the checksum doesn't match or the artifact is absent, it downloads it from PyPI by that hardcoded version, and checks it against that hard-coded SHA-256 before finally writing it on disk.
That's an interesting question. I suppose it is currently decided by a human. That hasn't changed. But I can imagine that there could be an automation / test that could check the metadata in the wheel too. This is not intended to be changed in the PR, but certainly is an appealing future improvement aspect.
The old process involved updating the hard-coded pip version and committing the actual wheel file. The new process is to update the hard-coded pip version and the corresponding SHA-256 hash that can be copied from PyPI, for example, or calculated locally. There are no other changes concerning the pip maintainers in this workflow — the patch is smaller, the backports are the same. UPD: clarified this in the PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very much on board with the stated goal of this PR of removing the binary blobs from the repository. I'm not fully convinced by the approach, though: this adds quite a lot of code to download (currently, and probably never again more than) one blob, and it took me several minutes of searching to find where the actual download was happening, which turns out to be buried in a module called _structs
.
ISTM that if we're reaching out to the internet anyway, we could have a much lower maintenance burden here by making use of bootstrap.pypa.io/get-pip.py
which could easily grow a --download-only
option with a ~3 line implementation. get-pip.py --no-setuptools --no-wheel --download-only 'pip=={pip_version}' -w Lib/ensurepip/_bundled
would pretty much do everything we want as far as I can tell. If we used a requirements file with it instead of the versioning argument, we could even have hash verification.
PCbuild/build.bat
Outdated
set suffix= | ||
if "%conf%"=="Debug" set suffix=_d | ||
set built_python_exe=%prefix%\python%suffix%.exe | ||
call "%built_python_exe%" -m ensurepip.bundle -v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a handy "%dir%\..\python.bat"
that you can use instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it guaranteed to locate the built interpreter version? I was a bit lost and haven't found how this file is generated to check it myself..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; see
cpython/PCbuild/python.vcxproj
Lines 134 to 150 in 4116592
<Target Name="GeneratePythonBat" AfterTargets="AfterBuild"> | |
<PropertyGroup> | |
<_Content>@rem This script invokes the most recently built Python with all arguments | |
@rem passed through to the interpreter. This file is generated by the | |
@rem build process and any changes *will* be thrown away by the next | |
@rem rebuild. | |
@rem This is only meant as a convenience for developing CPython | |
@rem and using it outside of that context is ill-advised. | |
@echo Running $(Configuration)^|$(Platform) interpreter... | |
@setlocal | |
@set PYTHONHOME=$(PySourcePath) | |
@"$(OutDir)python$(PyDebugExt).exe" %* | |
</_Content> | |
<_ExistingContent Condition="Exists('$(PySourcePath)python.bat')">$([System.IO.File]::ReadAllText('$(PySourcePath)python.bat'))</_ExistingContent> | |
</PropertyGroup> | |
<WriteLinesToFile File="$(PySourcePath)python.bat" Lines="$(_Content)" Overwrite="true" Condition="'$(_Content)' != '$(_ExistingContent)'" /> | |
</Target> |
(Note that %dir%\..\python.bat
and %dir%\find_python.bat
are very different :). You want the former, which is also used in the PGO section several lines up.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I applied the suggested change. Marking this thread as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zware could you take a look at https://dev.azure.com/Python/cpython/_build/results?buildId=136036&view=logs&j=91c152bd-7320-5194-b252-1404e56e2478&t=c7e99cd8-4756-5292-d34b-246ff5fc615f&l=764 — any idea why it might be failing? I don't see how it might be related to the PR, but OTOH, it doesn't crash on main
…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That particular build is apparently trying to run an arm64
build on what is apparently an amd64
runner, which isn't going to work out well :)
Note to myself: this will need a companion change @ https://github.com/python/buildmaster-config/blob/6fed7ed/master/custom/factories.py#L79. |
Per suggestion at python#12791 (comment). Co-Authored-By: Zachary Ware <[email protected]>
Suggested via python#12791 (comment). Co-Authored-By: Zachary Ware <[email protected]>
I don't know if we should bring externally hosted executable code into the play, as it is an additional security concern to watch out for. |
Agreed, let's not do that. The PyPI API for downloading a package is simple enough, let's just stick to stdlib code to grab the wheel. I'd also like to have either command line or environment variable options to skip the download, so that those of us who build in isolated environments can just skip it and not have to deal with the internet access. |
+1 to both the points @zooba made.
I do agree with this point, though. One of the reasons I haven't gone through the code in this PR yet is it does seem pretty complex for something that's essentially just downloading a single file. Why not just hard-code the pip version, sha, and URL (all of which can be copied from the PyPI download page) and just do a simple HTTP download? It adds a little more work for the pip RM to get the URL, but I'm OK with that (I'm speaking as a pip RM here, in case it's not obvious!) As far as I know, PyPI download URLs are permanent, so there's no risk of the location moving.
OK, so management of pip releases remains identical, which is great, as it means our existing, working, workflow can continue and we don't have to worry about edge cases like I started to do above 🙂 |
Fun fact: the very first implementation I had in this PR years ago did just that — it actually used URLs with SHA-256 in the fragment. See: e2dcb69#diff-ecfb0d1f34501dd30722053b9b39dea9139ffdd30dfcdc0964bcb18b95e443e4R16-R19. |
Some of it is a byproduct of trying to have a shared place for data that's used by the downloader and the normal runtime. I also wanted things to be testable and to be able to produce 100% coverage I figured that import-time evaluation present previously in some places was making it hard to achieve, which is why I changed it to push as much as possible from import- to run-time. Admittedly, this might've resulted in some overengineering. @pfmoore would it make sense to include getting rid of the |
I think it did, a bit :-) I wasn't aware of the predictable URL scheme. I would copy the
The tags are all static, as is the project name. Again, a comment noting that pip wheels are always I really don't see therefore why we need anything more complex than
|
I should note that I still haven't gone through the code in the existing PR in detail, so I'm not quite sure what you mean when you say "a shared place for data that's used by the downloader and the normal runtime". I'm not sure of the benefit, as the only data the downloader needs is the pip version and the hash - are you thinking of the wheel name itself? If so, then I think you're overcomplicating again - pip wheels are always |
@pfmoore that's what's implemented @ c31bdca#diff-f5e321563d5bc98371a960662f654a9387fd24f9bacb382d5b90d417ad2569c7R33-R34. Note that due to the fact that the original code has a tuple of wheels (used to be pip + setuptools) and the removal of setuptools didn't change that, this tuple remains in the code base (as I felt like that was out of the scope here + this PR was also started at the time when there were two things in that tuple). That essentially dictates a good chunk of loops in the code base and strings like I know for sure that removing this will make it more straightforward to deal with the control flow. What's your take on this? Should I try implementing that removal prior to this PR? |
@pfmoore also note that the runtime use of the wheels is dependent on a compile-time setting |
OK, I agree with @pradyunsg that we should remove all of the code that allows for multiple wheels, and strip everything back to the core job of managing a single pip wheel. If the complexity here is because you're trying to preserve generality because of the possibility of multiple wheels, then I think that's unnecessary. I personally don't think there's any need to worry about having multiple PRs reflecting refactoring vs new features here. If you want to do an initial PR that switches to only caring about a pip wheel, and follow up with the download-at-build-time PR, then fair enough, but personally I'd just do it as a single PR. |
Just so we're clear what I'm talking about, my suggestion (if you want to leave "refactor _PIP_HASH = '7ccf472345f20d35bdc9d1841ff5f313260c2c33fe417f48c30ac46cccabf5be' and then add a single new file, import ensurepip
import re
import hashlib
from urllib.request import urlopen
from pathlib import Path
from importlib.resources import (
abc as _resources_abc,
files as _get_traversable_dir_for,
)
BUNDLED_WHEELS_PATH: _resources_abc.Traversable = (
_get_traversable_dir_for(__package__) / "_bundled"
)
# The following code was copied from
# https://warehouse.pypa.io/api-reference/integration-guide.html#predictable-urls
#
# It generates stable download URLs for project artifacts.
#
# Note that we rely on the fact that pip is (as a matter of policy) portable code,
# and so we can guarantee that we'll always have the values
# build_tag = ""
# python_tag = "py3"
# abi_tag = "none"
# platform_tag = "any"
host = 'https://files.pythonhosted.org'
def source_url(name, version):
return f'{host}/packages/source/{name[0]}/{name}/{name}-{version}.tar.gz'
def wheel_url(name, version, build_tag, python_tag, abi_tag, platform_tag):
# https://www.python.org/dev/peps/pep-0491/#file-name-convention
wheel_parts = {
tag: re.sub('[^\w\d.]+', '_', part, re.UNICODE)
for tag, part in locals().items()
}
wheel_parts['optional_build_tag'] = f'-{wheel_parts["build_tag"]}' if build_tag else ''
filename = '{name}-{version}{optional_build_tag}-{python_tag}-{abi_tag}-{platform_tag}.whl'\
.format_map(wheel_parts)
return f'{host}/packages/{python_tag}/{name[0]}/{name}/{filename}'
# End of code copied from the PyPI docs
def download_pip(version: str, hash: str):
url = wheel_url("pip", version, "", "py3", "none", "any")
target = BUNDLED_WHEELS_PATH / url.rpartition('/')[2]
with urlopen(url) as f:
content = f.read()
if hashlib.sha256(content).hexdigest() != hash:
raise ValueError("Downloaded file has invalid hash")
target.write_bytes(content)
if __name__ == "__main__":
# Do we need to delete existing `.whl` files from BUNDLED_WHEELS_PATH?
download_pip(ensurepip._PIP_VERSION, ensurepip._PIP_HASH) (Tidy up the formatting, add some comments, and maybe some progress reporting and better error handling as you wish). Once that's in, and the CI stuff (which I've not looked at, because I don't really know how it works) is working, then worry about refactoring out the vestiges of the code that expected setuptools to be present as well. The above code may be less well-engineered than the code in the PR, but it's a heck of a lot easier to review (IMO). |
You can look at from test import test_tools
test_tools.skip_if_missing('clinic')
with test_tools.imports_under_tool('clinic'):
import clinic A |
I agree with Paul -- you could potentially go even further and hard-code If keeping the bootstrapping tool in Edit: Assuming we want to drop multiple-projects support later, the bootstrap should probably be: def download_wheel(project: str, version: str, hash: str):
url = wheel_url(project, version, "", "py3", "none", "any")
target = BUNDLED_WHEELS_PATH / url.rpartition('/')[2]
with urlopen(url) as f:
content = f.read()
if hashlib.sha256(content).hexdigest() != hash:
raise ValueError("Downloaded file has invalid hash")
target.write_bytes(content)
if __name__ == "__main__":
# Do we need to delete existing `.whl` files from BUNDLED_WHEELS_PATH?
for proj, ver, chksum in ensurepip._PROJECTS:
download_wheel(proj, ver, chksum) Assuming the third tuple entry is changed from the python tag to the SHA 256 checksum. A |
I'll echo the general sentiment that this has a lot of complexity being introduced and it would be good to reduce complexity if we're reworking how things work -- many of the changes make more sense in the context of "setuptools" and "pip" being needed and IMO this definitely does not need a public module to be exposed for it.
TBH, that's the approach I like best among everything that's come up in the discussion. I don't quite follow why you think that's fragile. And, as someone who would need to do these bump PRs, it isn't particularly annoying either (copy the URL for the relevant wheel from https://pypi.org/simple/pip/ and paste it in). |
) | ||
|
||
try: | ||
first_matching_dist_wheel = sorted(dist_matching_wheels)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self note: the original code actually goes for the last matching:
first_matching_dist_wheel = sorted(dist_matching_wheels)[0] | |
last_matching_dist_wheel = sorted(dist_matching_wheels)[-1] |
So I got around to removing the looping: #109245. |
@pradyunsg communicated IRL that it's been decided to postpone the effort indefinitely due to the contributor experience concerns. Which is why I'm closing the PR for now. If something changes, it can then be resurrected once again. |
Prior to this patch, Pip wheels were stored in the Git repository of
CPython. Git is optimized for text but these artifacts are binary. So
the unpleasant side effect of doing this is that the bare Git
repository size is being increased by the zip archive side every time
it is added, removed or modified. It's time to put a stop to this.
The patch implements an
ensurepip.bundle
module that is meant to becalled through
runpy
to download the Pip wheel and place it into thesame location as before. It removes the wheel file from the Git
repository and prevents re-adding it by defining a new
.gitignore
configuration file.
The idea is that the builders of CPython are supposed to invoke the
following command during the build time:
$ python -m ensurepip.bundle
This command will verify the existing wheel's SHA-256 hash and, if it
does not match, or doesn't exist, it will proceed to download the
artifact from PyPI. It will confirm its SHA-256 hash before placing it
into the
Lib/ensurepip/_bundled/
directory.The only change concerning the Pip maintainers submitting pip version
bumps to the CPython repo is that they will no longer need to check in
the wheel file binary, but would instead have to update SHA-256 and pip
version strings at the end of
Lib/ensurepip/_structs.py
. Previously,the pip version used to be hard-coded at the top of
Lib/ensurepip/__init__.py
and SHA-256 was not included.Every single line added or modified as a part of this change is also
covered with tests. Every new module has 100% coverage. The only
uncovered lines under
Lib/ensurepip/
are the ones that areabsolutely unrelated to this effort.
Resolves #80789.
Ref: https://bugs.python.org/issue36608.
Coverage
https://bugs.python.org/issue36608