Skip to content

Reproducible installs #5648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ntninja opened this issue Jul 23, 2018 · 28 comments
Open

Reproducible installs #5648

ntninja opened this issue Jul 23, 2018 · 28 comments
Labels
type: feature request Request for a new feature

Comments

@ntninja
Copy link

ntninja commented Jul 23, 2018

What's the problem this feature will solve?

Reproducible installations. Currently when running pip install all relevant Python packages are installed into their respective and Python bytecode file are generated appropriately. Unfortunately these files are not reproducible (their contain the timestamps of the files they were generated for) and will therefor cause filesystem images they were created for to be non-reproducible as well.

Describe the solution you'd like
With Python 3.7 and PEP-552 a new and clean solution for this problem is now finally visible on the horizion.
Basically the call to py_compile.compile in PIP should be enhanced like this:

py_compile.compile(…, invalidation_mode=py_compile.PycInvalidationMode.UNCHECKED_HASH)

Since I'm guessing that PIP devs don't want to do this for all installations – there is no good reason IMHO, just assuming – another command-line flag will be required that allows one to the installation as reproducible as possible by enabling this flag.
(For full reproducibility installed shared libraries would require their timestamps to be zeroed as well, but I don't see how PIP can be any help in this currently.)

If the PIP team is willing to enable this by default, then it should only be enabled for non-editable system installs. Otherwise people will be surprised that their Python source changes are ignored by the interpreter.

My main (personal) use-case currently is docker-image-rebuilder: It runs a full docker build procedure then hashes the resulting filesystem and publishes the new version if there were any changes. PEP-552 also mentions build systems like Bazel and just about any Linux-distro as its use-cases. Most of these likely don't use PIP for gathering packages through.

Alternative Solutions
Since there are other non-reproducible files generated as well, I resort to filepath filtering rules for skipping this problem like "**/__pycache__/*.pyc" right now.

Additional context
Reproducible Builds (and, by extension, installs as well) are the future! 🙂

@pradyunsg pradyunsg added the type: feature request Request for a new feature label Jul 24, 2018
@mhsmith
Copy link
Contributor

mhsmith commented Jul 27, 2018

Alternatively, pip could just preserve the timestamps from extracted wheel files. The zipfile module can't do this itself, but it could be accomplished by adding a single line of code to unzip_file in https://github.com/pypa/pip/blob/master/src/pip/_internal/utils/misc.py:

os.utime(fn, (time.time(), calendar.timegm(info.date_time)))

This will only help if you're installing from wheel files, but if you care about reproducibility then you should probably be doing that anyway.

@ntninja
Copy link
Author

ntninja commented Jul 27, 2018

@mhsmith: True that! For wheel files this is probably the way to go. As far as I can tell the untar_file function already does in this line: tar.utime(member, path). So I guess what you're proposing should be a rather uncontroversial patch. 🙂

On the other hand, source installs are still important through: in my particular case many images are based on Alpine Linux, which uses musl-libc instead of glibc, so I cannot reuse -manylinux packages for binaries unfortunately. Also compiling from source IMHO makes sense here, since it's a compile-once/reuse-many kind of build, where reducing size is more important than build times.

@ntninja
Copy link
Author

ntninja commented Jul 30, 2018

I created a PR regarding the low-level plumbing in distlib: https://bitbucket.org/pypa/distlib/pull-requests/38
Any comments/feedback on this would be apprechiated!

@jdemeyer
Copy link

jdemeyer commented Aug 4, 2018

Alternatively, pip could just preserve the timestamps from extracted wheel files.

Do you mean when installing a package from a wheel? How is that relevant to this issue? I'm not completely following.

@mhsmith
Copy link
Contributor

mhsmith commented Aug 4, 2018

The issue was that pip-generated .pyc files are not currently reproducible, because they contain embedded copies of the filesystem timestamps of the .py files they were built from.

@jdemeyer
Copy link

jdemeyer commented Aug 4, 2018

So you're saying that the actual bytes in a .pyc file depend on the timestamp of the corresponding .py file? That's strange...

@mhsmith
Copy link
Contributor

mhsmith commented Aug 4, 2018

It's so the interpreter can decide whether the file needs to be recompiled. Obviously this can give both false positives and false negatives, so PEP 552 specifes a replacement scheme which uses a content hash. This was implemented in Python 3.7, but is still not enabled by default.

@ntninja
Copy link
Author

ntninja commented Aug 5, 2018

@jdemeyer: I think I should point out that @mhsmith and me are talking about two seperate but very closely related sub-issues with regards to reproducible installs:

  1. I was talking about the .pyc files generated by PIP not being reproducible because they embed the modification timestamps of their respective .py files at all.
  2. @mhsmith was talking about the .pyc files generated by PIP not being reproducible because they embed the modification timestamps of their respective .py files with those timestamps being different on each install.

My immediate use-case are sdist installs were the modification time of the installed file can indeed vary with each recompilation.
@mhsmith's immediate use-case are wheel installs were the modification times of the files in the wheel are currently not always retained upon installation (but have defined valued that could/should be retained).

Enabling installation with content-hashes (what this issue is mainly about) fixes the reproducibility issue in all cases, but is only available in Python 3.7+.
Properly extracting the mtimes from wheels (@mhsmith's original comment), only fixes the issue in that context; it will however work on all interpreter versions. (+ Not ignoring mtimes that are already present is the right thing to do in next to all cases anyways.)

@jdemeyer
Copy link

jdemeyer commented Aug 6, 2018

I wrote a proposal to distutils-sig for changing the timestamps of installed files. It's not directly related to this issue, but since it's about timestamps, so you may find it relevant.

@ntninja
Copy link
Author

ntninja commented Aug 10, 2018

@jdemeyer: In general it feels like timestamps should be retained as much as possible (it should only change if somebody, or something, actually changed the file's contents).

Also the comparision with autotools is not really fair in the wheel case since autotools is a build-system, but wheels (by definition) are not built – they are only installed – and so no modification of files, and therefor timestamp changes, should take place. The only use-case I can find in your proposal for changing the timestamps regardless is to have a “when was this installed?” time of reference for recompilation tracking. Did you consider adding an extra …/installed marker file and touching/stating that as appropriate instead?

Regarding source installs I agree that preserving the timestamps is pretty problematic however (arbitrary files may be modified in unexpected ways). So restamping all files post-build seems like a reasonable idea; special-casing the .py files feels sub-optimal however. I not sure if there is any other solution however – you probably know that better than me.

^ Just my thoughts after reading your proposal. Maybe it's useful to you. 🙂

@jdemeyer
Copy link

Also the comparision with autotools is not really fair in the wheel case since autotools is a build-system, but wheels (by definition) are not built – they are only installed

My proposal concerns precisely the installation part, not the build part. So I don't see why wheels should be different from a from-source build.

@jdemeyer
Copy link

Did you consider adding an extra …/installed marker file and touching/stating that as appropriate instead?

No because that's completely incompatible with other build systems. The point here is to make Python-installed packages more similar to other-build-system installed packages.

@ntninja
Copy link
Author

ntninja commented Aug 11, 2018

@jdemeyer: Because installation is fundamentally a non-mutating operation? („Take the files from there and put them here.“)
I also don't get why you say „I don't see why wheels should be different from a from-source build“ because there really is no reason. 🙂 When you take the source and build it into “binary” form (creating a wheel iirc) that obviously is a mutating operation, so retaining the previous timestamps doesn't really make sense (unless you really just copy-paste the files, maybe). But when you then install the generated wheel no files are actually changed anymore so why would you insistant on marking the files as changed, by chaning their timestamps, anyways?

Also IMHO a better comparision for the installation phase, in the PIP context, would be the Debian package manager: When you select a package to be installed it will simply assemble the list of packages, download them to the system, and then extract their contents to the filesystem – pretty much exactly what PIP does when it has wheels. (That fact that it will also run maintainer scripts is beside the point here.) The comparison with autotools doesn't even make sense unless you're talking about sdists. APT also retains the timestamps from packages it downloads.

@jdemeyer
Copy link

jdemeyer commented Aug 12, 2018

The comparison with autotools doesn't even make sense unless you're talking about sdists.

Again, I'm only looking at the installation part, not the build part. In that respect:

  • Autotools installs files
  • pip installs files

Seems pretty much the same...

@jdemeyer
Copy link

so why would you insistant on marking the files as changed, by chaning their timestamps, anyways?

As I explained in my proposal, for dependency checking. This is a good practical reason.

Note that neither shutil.copy not the Unix cp tool keeps the timestamps of copied files. So your argument that, because it's a copy, the timestamps should be preserved is clearly not true.

@cjerdonek
Copy link
Member

Note that neither shutil.copy not the Unix cp tool keeps the timestamps of copied files.

Since you mention shutil.copy(), FWIW, shutil.copytree() uses shutil.copy2() to copy each individual file, thus preserving the timestamps of individual files. Since installation involves copying a directory of files, copytree() might be the appropriate comparison.

Note that I'm not expressing a preference for one way or the other -- I'm just noting the shutil.copytree() behavior.

@toddrme2178
Copy link

This is important for openSUSE when building from wheels. We are trying to make the builds for the whole project reproducible, and we are almost there. One of the very few areas we have left are python packages installed from wheels. So it would be really helpful for us if there was some way to make packages relying on wheels reproducible.

@nanonyme
Copy link

Let's please follow standard library on this: set CHECKED_HASH only if SOURCE_DATE_EPOCH is defined.

@dawagner
Copy link

dawagner commented Nov 3, 2020

What is the consensus? I feel like the initial proposed solution (add a command line option for setting the invalidation mode passed to py_compile.compile, compileall.compile_dir, and compileall.compile_file) is the most obvious one and in line with what PEP-552 is trying to achieve.

I also have a use-case for this: I'd like to make reproducible docker layers (never mind the timestamp of installed files after the .pyc files have been generated because they can be modified after the fact, whereas the content of the generated .pyc files can not be so easily).

@nanonyme
Copy link

nanonyme commented Nov 3, 2020

That seems fine to me. Note SOURCE_DATE_EPOCH is industry standard way for "make this reproducible".

@ntninja
Copy link
Author

ntninja commented Nov 3, 2020

@dawagner: You might want to look at my docker-image-rebuilder which uses reproducibility in generated OCI/Docker images to only push image updates when the build actually resulted in a significant change. (You can also set ignore list for files beyond reproducibility control.)

@nanonyme
Copy link

nanonyme commented Nov 3, 2020

Note bytecode with pip should already be as reproducible as it gets. Which is not fully reproducible. Python itself cannot generate fully reproducible bytecode yet.

@nanonyme
Copy link

nanonyme commented Dec 8, 2020

I'm now a bit worried https://github.com/pypa/pip/blob/8ce5d5abbcc5bbcb8b3bf1f364ac6540a2b60b22/src/pip/_vendor/distlib/wheel.py in fact seems not to enable hashed validate. So possibly this is a problem for wheel installs still. (pip automatically creates wheels nowadays for everything before installing)

@nanonyme
Copy link

nanonyme commented Dec 8, 2020

It seems this is not the right repo to fix the problem though. The file is vendored from https://bitbucket.org/pypa/distlib/src/207d4599a330913d51a4c5865745a49e11e851cb/distlib/wheel.py#lines-489

@uranusjr
Copy link
Member

uranusjr commented Dec 9, 2020

IIRC distlib.wheel is not used anywhere in pip’s wheel installation implementation.

@facundobatista
Copy link

Another point where reproducible installs are broken is when compiling the .pyc when installing from a wheel, as the wheel unpacking is done in a temporary directory, whose name ends in the compiled bytecode:

$ python3 -m venv testenv
$ hexdump -C testenv/lib/python3.8/site-packages/__pycache__/easy_install.cpython-38.pyc 
00000000  55 0d 0d 0a 00 00 00 00  bd 4d 5c 61 7e 00 00 00  |U........M\a~...|
00000010  e3 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 02 00 00 00 40 00 00  00 73 22 00 00 00 64 00  |[email protected]"...d.|
00000030  5a 00 65 01 64 01 6b 02  72 1e 64 02 64 03 6c 02  |Z.e.d.k.r.d.d.l.|
00000040  6d 03 5a 03 01 00 65 03  83 00 01 00 64 04 53 00  |m.Z...e.....d.S.|
00000050  29 05 7a 1b 52 75 6e 20  74 68 65 20 45 61 73 79  |).z.Run the Easy|
00000060  49 6e 73 74 61 6c 6c 20  63 6f 6d 6d 61 6e 64 da  |Install command.|
00000070  08 5f 5f 6d 61 69 6e 5f  5f e9 00 00 00 00 29 01  |.__main__.....).|
00000080  da 04 6d 61 69 6e 4e 29  04 da 07 5f 5f 64 6f 63  |..mainN)...__doc|
00000090  5f 5f da 08 5f 5f 6e 61  6d 65 5f 5f 5a 1f 73 65  |__..__name__Z.se|
000000a0  74 75 70 74 6f 6f 6c 73  2e 63 6f 6d 6d 61 6e 64  |tuptools.command|
000000b0  2e 65 61 73 79 5f 69 6e  73 74 61 6c 6c 72 03 00  |.easy_installr..|
000000c0  00 00 a9 00 72 06 00 00  00 72 06 00 00 00 fa 30  |....r....r.....0|
000000d0  2f 74 6d 70 2f 70 69 70  2d 75 6e 70 61 63 6b 65  |/tmp/pip-unpacke|
000000e0  64 2d 77 68 65 65 6c 2d  31 76 31 74 5f 35 37 35  |d-wheel-1v1t_575|
000000f0  2f 65 61 73 79 5f 69 6e  73 74 61 6c 6c 2e 70 79  |/easy_install.py|
00000100  da 08 3c 6d 6f 64 75 6c  65 3e 01 00 00 00 73 06  |..<module>....s.|
00000110  00 00 00 04 02 08 01 0c  01                       |.........|
00000119

(see /tmp/pip-unpacked-wheel-1v1t_575/easy_install.py inside there)

@nanonyme
Copy link

nanonyme commented Oct 5, 2021

Have you tried with new --use-feature=in-tree-build?

@facundobatista
Copy link

Nop, don't have that in pip 20.1.1 :/. Thanks for the tip, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

10 participants