-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Reproducible installs #5648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Alternatively, pip could just preserve the timestamps from extracted wheel files. The
This will only help if you're installing from wheel files, but if you care about reproducibility then you should probably be doing that anyway. |
@mhsmith: True that! For wheel files this is probably the way to go. As far as I can tell the On the other hand, source installs are still important through: in my particular case many images are based on Alpine Linux, which uses musl-libc instead of glibc, so I cannot reuse |
I created a PR regarding the low-level plumbing in |
Do you mean when installing a package from a wheel? How is that relevant to this issue? I'm not completely following. |
The issue was that pip-generated .pyc files are not currently reproducible, because they contain embedded copies of the filesystem timestamps of the .py files they were built from. |
So you're saying that the actual bytes in a |
It's so the interpreter can decide whether the file needs to be recompiled. Obviously this can give both false positives and false negatives, so PEP 552 specifes a replacement scheme which uses a content hash. This was implemented in Python 3.7, but is still not enabled by default. |
@jdemeyer: I think I should point out that @mhsmith and me are talking about two seperate but very closely related sub-issues with regards to reproducible installs:
My immediate use-case are sdist installs were the modification time of the installed file can indeed vary with each recompilation. Enabling installation with content-hashes (what this issue is mainly about) fixes the reproducibility issue in all cases, but is only available in Python 3.7+. |
I wrote a proposal to distutils-sig for changing the timestamps of installed files. It's not directly related to this issue, but since it's about timestamps, so you may find it relevant. |
@jdemeyer: In general it feels like timestamps should be retained as much as possible (it should only change if somebody, or something, actually changed the file's contents). Also the comparision with Regarding source installs I agree that preserving the timestamps is pretty problematic however (arbitrary files may be modified in unexpected ways). So restamping all files post-build seems like a reasonable idea; special-casing the ^ Just my thoughts after reading your proposal. Maybe it's useful to you. 🙂 |
My proposal concerns precisely the installation part, not the build part. So I don't see why wheels should be different from a from-source build. |
No because that's completely incompatible with other build systems. The point here is to make Python-installed packages more similar to other-build-system installed packages. |
@jdemeyer: Because installation is fundamentally a non-mutating operation? („Take the files from there and put them here.“) Also IMHO a better comparision for the installation phase, in the PIP context, would be the Debian package manager: When you select a package to be installed it will simply assemble the list of packages, download them to the system, and then extract their contents to the filesystem – pretty much exactly what PIP does when it has wheels. (That fact that it will also run maintainer scripts is beside the point here.) The comparison with autotools doesn't even make sense unless you're talking about sdists. APT also retains the timestamps from packages it downloads. |
Again, I'm only looking at the installation part, not the build part. In that respect:
Seems pretty much the same... |
As I explained in my proposal, for dependency checking. This is a good practical reason. Note that neither |
Since you mention Note that I'm not expressing a preference for one way or the other -- I'm just noting the |
This is important for openSUSE when building from wheels. We are trying to make the builds for the whole project reproducible, and we are almost there. One of the very few areas we have left are python packages installed from wheels. So it would be really helpful for us if there was some way to make packages relying on wheels reproducible. |
Let's please follow standard library on this: set CHECKED_HASH only if SOURCE_DATE_EPOCH is defined. |
What is the consensus? I feel like the initial proposed solution (add a command line option for setting the invalidation mode passed to I also have a use-case for this: I'd like to make reproducible docker layers (never mind the timestamp of installed files after the .pyc files have been generated because they can be modified after the fact, whereas the content of the generated .pyc files can not be so easily). |
That seems fine to me. Note SOURCE_DATE_EPOCH is industry standard way for "make this reproducible". |
@dawagner: You might want to look at my docker-image-rebuilder which uses reproducibility in generated OCI/Docker images to only push image updates when the build actually resulted in a significant change. (You can also set ignore list for files beyond reproducibility control.) |
Note bytecode with pip should already be as reproducible as it gets. Which is not fully reproducible. Python itself cannot generate fully reproducible bytecode yet. |
I'm now a bit worried https://github.com/pypa/pip/blob/8ce5d5abbcc5bbcb8b3bf1f364ac6540a2b60b22/src/pip/_vendor/distlib/wheel.py in fact seems not to enable hashed validate. So possibly this is a problem for wheel installs still. (pip automatically creates wheels nowadays for everything before installing) |
It seems this is not the right repo to fix the problem though. The file is vendored from https://bitbucket.org/pypa/distlib/src/207d4599a330913d51a4c5865745a49e11e851cb/distlib/wheel.py#lines-489 |
IIRC |
Another point where reproducible installs are broken is when compiling the
(see |
Have you tried with new --use-feature=in-tree-build? |
Nop, don't have that in |
What's the problem this feature will solve?
Reproducible installations. Currently when running
pip install
all relevant Python packages are installed into their respective and Python bytecode file are generated appropriately. Unfortunately these files are not reproducible (their contain the timestamps of the files they were generated for) and will therefor cause filesystem images they were created for to be non-reproducible as well.Describe the solution you'd like
With Python 3.7 and PEP-552 a new and clean solution for this problem is now finally visible on the horizion.
Basically the call to
py_compile.compile
in PIP should be enhanced like this:Since I'm guessing that PIP devs don't want to do this for all installations – there is no good reason IMHO, just assuming – another command-line flag will be required that allows one to the installation as reproducible as possible by enabling this flag.
(For full reproducibility installed shared libraries would require their timestamps to be zeroed as well, but I don't see how PIP can be any help in this currently.)
If the PIP team is willing to enable this by default, then it should only be enabled for non-editable system installs. Otherwise people will be surprised that their Python source changes are ignored by the interpreter.
My main (personal) use-case currently is
docker-image-rebuilder
: It runs a fulldocker build
procedure then hashes the resulting filesystem and publishes the new version if there were any changes. PEP-552 also mentions build systems like Bazel and just about any Linux-distro as its use-cases. Most of these likely don't use PIP for gathering packages through.Alternative Solutions
Since there are other non-reproducible files generated as well, I resort to filepath filtering rules for skipping this problem like
"**/__pycache__/*.pyc"
right now.Additional context
Reproducible Builds (and, by extension, installs as well) are the future! 🙂
The text was updated successfully, but these errors were encountered: