Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Improve startup time. #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
markshannon opened this issue Mar 25, 2021 · 48 comments
Closed

Improve startup time. #32

markshannon opened this issue Mar 25, 2021 · 48 comments

Comments

@markshannon
Copy link
Member

markshannon commented Mar 25, 2021

Before we attempt this, we need to know where the time is spent.
@ericsnowcurrently Do you have any profiling information from running python -S -c "print('Hi')", or a similar script?

It takes about 10 ms on my machine for 3.9 which seems a long time to print "Hi".
For comparison, 2.7 takes about 7ms.

I suspect there is a lot of incidental stuff going on that we can eliminate.

The tasks that have to be performed to run python -S -c "print('Hi')" are:

  • Load executable from disk (it should be in memory cache, but it still needs work by the O/S).
  • Build config object from command line and environment variables
  • Create a new interpreter
  • Create a new thread
  • Compile (including parsing) print(Hi")
  • Execute print(Hi")
  • Dispose of thread
  • Dispose of interpreter

None of those tasks should take long, so what is slow?

@gvanrossum
Copy link
Collaborator

We also need to o initialize some extension modules, sys and built ons.

@markshannon
Copy link
Member Author

markshannon commented Mar 25, 2021

If creating and/or destroying the new interpreter is what takes the time, then one possible way to speed it up is to freeze the object graph of a newly created interpreter.

That would work roughly as follows:

  1. Modify CPython to walk the object graph, immediately after interpreter creation, dumping it to a text file.
  2. Convert that text file into a declarative spec of the initial object graph.
  3. Check in that spec, we will want to modify it later, and it will probably need fixing up by hand.
  4. Discard the modified CPython, we shouldn't need it anymore.
  5. Make a tool that generates two pieces of C code:
    • A static structure containing the whole object graph, with offsets not pointers
    • A table containing offsets of those offsets

We can now create the entire object graph for the interpreter by:

  1. Copying the static structure into the newly allocated memory for the interpreter.
  2. Traversing the offsets of offsets, converting them into pointers.

The above assumes that interpreters are fully independent.
Since they are not, at least not yet, we need to break the object graph into the part belonging to the runtime, and that belonging to each interpreter. For the runtime (static objects) there is no need for offsets, just create the graph with pointers. For the interpreter, the data structure needs to mark offsets so we can tell which are pointers into the runtime, and which are pointers into the interpreter.

@markshannon
Copy link
Member Author

I've just remembered python3 -v.

Running python3 -v -S -c "print('Hi')" prints

import _frozen_importlib # frozen
import _imp # builtin
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
import '_io' # <class '_frozen_importlib.BuiltinImporter'>
import 'marshal' # <class '_frozen_importlib.BuiltinImporter'>
import 'posix' # <class '_frozen_importlib.BuiltinImporter'>
import '_frozen_importlib_external' # <class '_frozen_importlib.FrozenImporter'>
# installing zipimport hook
import 'time' # <class '_frozen_importlib.BuiltinImporter'>
import 'zipimport' # <class '_frozen_importlib.FrozenImporter'>
# installed zipimport hook
# /home/mark/repos/cpython/Lib/encodings/__pycache__/__init__.cpython-310.pyc matches /home/mark/repos/cpython/Lib/encodings/__init__.py
# code object from '/home/mark/repos/cpython/Lib/encodings/__pycache__/__init__.cpython-310.pyc'
# /home/mark/repos/cpython/Lib/__pycache__/codecs.cpython-310.pyc matches /home/mark/repos/cpython/Lib/codecs.py
# code object from '/home/mark/repos/cpython/Lib/__pycache__/codecs.cpython-310.pyc'
import '_codecs' # <class '_frozen_importlib.BuiltinImporter'>
import 'codecs' # <_frozen_importlib_external.SourceFileLoader object at 0x7f6fbcf2d4f0>
# /home/mark/repos/cpython/Lib/encodings/__pycache__/aliases.cpython-310.pyc matches /home/mark/repos/cpython/Lib/encodings/aliases.py
# code object from '/home/mark/repos/cpython/Lib/encodings/__pycache__/aliases.cpython-310.pyc'
import 'encodings.aliases' # <_frozen_importlib_external.SourceFileLoader object at 0x7f6fbcec6d00>
import 'encodings' # <_frozen_importlib_external.SourceFileLoader object at 0x7f6fbcf2d340>
# /home/mark/repos/cpython/Lib/encodings/__pycache__/utf_8.cpython-310.pyc matches /home/mark/repos/cpython/Lib/encodings/utf_8.py
# code object from '/home/mark/repos/cpython/Lib/encodings/__pycache__/utf_8.cpython-310.pyc'
import 'encodings.utf_8' # <_frozen_importlib_external.SourceFileLoader object at 0x7f6fbcf2d370>
import '_signal' # <class '_frozen_importlib.BuiltinImporter'>
# /home/mark/repos/cpython/Lib/__pycache__/io.cpython-310.pyc matches /home/mark/repos/cpython/Lib/io.py
# code object from '/home/mark/repos/cpython/Lib/__pycache__/io.cpython-310.pyc'
# /home/mark/repos/cpython/Lib/__pycache__/abc.cpython-310.pyc matches /home/mark/repos/cpython/Lib/abc.py
# code object from '/home/mark/repos/cpython/Lib/__pycache__/abc.cpython-310.pyc'
import '_abc' # <class '_frozen_importlib.BuiltinImporter'>
import 'abc' # <_frozen_importlib_external.SourceFileLoader object at 0x7f6fbcec6f70>
import 'io' # <_frozen_importlib_external.SourceFileLoader object at 0x7f6fbcec6e50>

@ericsnowcurrently
Copy link
Collaborator

Before we attempt this, we need to know where the time is spent.
@ericsnowcurrently Do you have any profiling information from running python -S -c "print('Hi')", or a similar script?

I don't have any such info presently. I know there were investigations in the past that involved such profiling but I expect any resulting profiling data is outdated at this point.

Regardless, it shouldn't take much effort to get at least some basic data. Furthermore, we can already get at least some insight by running ./python -v -S -c "print('Hi')".

@gvanrossum
Copy link
Collaborator

Regarding Mark's strategy, can you explain what we would have to do when we need to change the object graph? IIUR currently we change some of the modules that are frozen (_frozen_importlib and friends), and then regenerate the frozen bytecode (for which we have tools). I am worried that your item (3) makes this process more complicated.

Otherwise, it sounds like a decent strategy (many data structures are already static webs of pointers, the new thing is that we allow static webs of PyObject pointers).

And yes, we need to work on profiling startup.

@ericsnowcurrently
Copy link
Collaborator

FTR, facebook/instagram has been using a related strategy, which they described at the language summit a couple years back.

@methane
Copy link

methane commented May 11, 2021

I've just remembered python3 -v.

Please don't forget -X importtime (or PYTHONPROFILEIMPORTTIME=1 python) too.

@gvanrossum
Copy link
Collaborator

Mark has a nascent proposal

@fweimer
Copy link

fweimer commented May 23, 2021

Mark has a nascent proposal

This is only beneficial if the size of mapped files exceeds a certain threshold, which is fairly large (probably larger than 100 KiB by now). Below that, copying is faster. With small files, I expect a run-time penalty even after loading because the less dense packing of information (distributed across a larger number of pages) results in increased TLB misses. And by default, Linux will only allow 65,530 separate mappings per process.

@gvanrossum
Copy link
Collaborator

gvanrossum commented May 23, 2021

My intuition is that it will be hard to show that mmap is faster than copying here, so I would personally be happy if the first version just read (i.e., copied) the file into memory.

@markshannon
Copy link
Member Author

The point of the proposal is to avoid the overhead of unmarshalling, whether the file is copied or mapped isn't so important.

The important thing is that the pyc format is position independent (no pointers), is immutable, and requires minimal decoding before being usable.
This is important, because it means we can merge the pyc files of common modules into a single file, or embed them in the interpreter.

@gvanrossum
Copy link
Collaborator

This is important, because it means we can merge the pyc files of common modules into a single file, or embed them in the interpreter.

Oh, that's really neat!

@gvanrossum
Copy link
Collaborator

I just discovered -X importtime. Running 3.10 with this, -S, and '-c pass' on my Windows laptop, shows:

import time: self [us] | cumulative | imported package
import time:       169 |        169 |   _io
import time:        38 |         38 |   marshal
import time:       135 |        135 |   nt
import time:        66 |         66 |   winreg
import time:       708 |       1114 | _frozen_importlib_external
import time:       593 |        593 |   time
import time:       294 |        886 | zipimport
import time:       124 |        124 |     _codecs
import time:      1108 |       1231 |   codecs
import time:      1105 |       1105 |   encodings.aliases
import time:      2101 |       4436 | encodings
import time:       702 |        702 | encodings.utf_8
import time:      1028 |       1028 | encodings.cp1252
import time:       106 |        106 | _signal
import time:        52 |         52 |     _abc
import time:      1504 |       1555 |   abc
import time:      1311 |       2866 | io

The biggest cumulative time (4.4 ms) is the encodings package, plus another 1.7 ms for two specific encodings. Maybe we can somehow avoid importing the full encodings package, and only import the encoding we really need?

The next biggest cost is io (2.8 ms), which is largely due to importing abc. Again, a possible target for specific hacks. (In both cases I suspect @vstinner has already thought about this.)

Of course, without -S, we spend a ton of time (25-30 ms!) importing site, which pulls in the kitchen sink. A possible hack targeting this could be to write a configuration file somewhere that lists the outcome of all the work done by site, plus a list of directories and their mtimes that should be statted so that the full thing only has to run if any of those directories have changed. Or something like that. (Maybe a hash of certain directories and files would be needed -- I imagine even that is faster than importing os and contextlib.)

Here's the raw data for that:

import time: self [us] | cumulative | imported package
import time:       223 |        223 |   _io
import time:        62 |         62 |   marshal
import time:       162 |        162 |   nt
import time:        73 |         73 |   winreg
import time:       960 |       1477 | _frozen_importlib_external
import time:       636 |        636 |   time
import time:       299 |        935 | zipimport
import time:       141 |        141 |     _codecs
import time:      1381 |       1522 |   codecs
import time:      1226 |       1226 |   encodings.aliases
import time:      2273 |       5020 | encodings
import time:       655 |        655 | encodings.utf_8
import time:       603 |        603 | encodings.cp1252
import time:        50 |         50 | _signal
import time:        45 |         45 |     _abc
import time:       762 |        806 |   abc
import time:       750 |       1556 | io
import time:        69 |         69 |       _stat
import time:       684 |        753 |     stat
import time:      1259 |       1259 |     _collections_abc
import time:       893 |        893 |       genericpath
import time:      1332 |       2225 |     ntpath
import time:      1284 |       5519 |   os
import time:       947 |        947 |   _sitebuiltins
import time:      1381 |       1381 |   types
import time:       695 |        695 |       warnings
import time:       852 |       1546 |     importlib
import time:       558 |        558 |     importlib._abc
import time:       140 |        140 |         itertools
import time:       585 |        585 |         keyword
import time:        74 |         74 |           _operator
import time:       754 |        827 |         operator
import time:       643 |        643 |         reprlib
import time:       540 |        540 |         _collections
import time:      1422 |       4154 |       collections
import time:        69 |         69 |         _functools
import time:      1073 |       1142 |       functools
import time:      1040 |       6335 |     contextlib
import time:       972 |       9410 |   importlib.util
import time:       755 |        755 |   importlib.machinery
import time:       715 |        715 |   sitecustomize
import time:       383 |        383 |   usercustomize
import time:      6707 |      25813 | site

@nascheme
Copy link

Some comments about this based on work I did at previous core sprints. Mark's ideas seem good to me and are along the same lines I was thinking. The unmarshal step is fairly expensive and I think we could reduce that with a change of the code object layout. Currently, the VM expects code objects to be (mostly) composed of PyObjects. It doesn't need to be that way and we could create a code object memory layout that's much cheaper to load from disk. I suggested something similar in spirit to Cap’n Proto. We probably want to kept the on-disk files machine independent.

The importlib overhead has been optimized over the years so there is no big low hanging fruit there. However, it seem seems wasteful how much work is done by importlib just to start up. I think we could dump all needed startup modules into a kind of optimized bundle (e.g. better optimized than a zipped package) and eliminate the importlib overhead. E.g. on startup call a C-function that unmarshals all the code on the bundle and then executes some bytecode to finish the startup. It doesn't need to be linked with the executable (as frozen modules) but could be an external file, e.g. in <prefix>/lib/python-<xyz>/_startup.pyb. That reduces bootstrap problems. A further refinement is to let application developers create their own bundles, for fast startup. E.g. a tool like Mercurial might want to do that. Have a command line option or env var that instructs Python to load additional bundles after the startup bundle.

When Python starts, the interleaving of __import__ with the unmarshal step and top-level code execution of modules makes it difficult to profile where time is being spent. If you look at a profile trace or output from -X importtime, those costs are intermixed. I wonder if there is some performance advantage to do similar operations together. E.g. 1. load all the code for startup from disk, 2. unmarshal all the code, 3. execute top-level code for all modules. It could help making sense of the profile and focusing efforts.

Another idea: try to reduce the amount of work done when executing top-level module code. Quite a bit of work has been done over the years to try to reduce startup time and therefore what's done at the top-level of modules imported at startup. E.g. do it the first time a function is called, rather than at import time. However, maybe we can do better. My lazy top-level code execution idea was one approach. The major killer of that was that metaclasses can basically execute anything. My AST walker couldn't find too much that could be lazily executed. I think Cinder has done something like this but I don't know details (StrictModule annotation?). Another issue: doing something lazily doesn't help performance if you end up doing anyhow. I think I was running into such an issue.

A twist on the lazy code execution is to just defer the loading/unmarshal of code objects of functions, until the function is actually called. It seems likely that quite a few functions in modules loaded at startup are never actually executed. So, does it help just to never load the code for them? I had a prototype of this idea but it didn't show much of a win. Might be worth revisiting. The original idea was inspired by something Larry did with PHP years ago.

Another idea: allow some code to be executed at module compile time. I.e. Python code that gets called by the compiler to compute objects that end up inside the .pyc. A simple example, assume my module needs to compute a dict that will become a constant global. It could be possible to do that work at compile time (rather than at import time) and then store the result of that computation in the .pyc. Sometime like:

if __compiletime__:
   GLOBAL_DICT = _compute_it_at_compile_time()

Obviously there would be a restrictions on what you can do at compile time. This would be an advanced feature and used carefully. Another place it could be used is to pre-compile regular expressions to something that can be quickly loaded at import time and still have good runtime performance. You can do something a bit like this by having code that generates a .py file and then compile that. However, having it a feature of the Python compiler would be nice (aside from the language complexity increase). Perhaps things like dataclasses could use the feature to do more work at compile time.

@JelleZijlstra
Copy link
Contributor

JelleZijlstra commented May 28, 2021

The if __compiletime__ idea would also be interesting for enums. In profiling I've seen a lot of import time being spent creating enum classes, and at least in theory that work could all be done at compile time.

@gvanrossum
Copy link
Collaborator

We probably want to kept the on-disk files machine independent.

Cap'n Proto's solution for this seems to be to declare that most modern CPUs are little-endian, so it uses that (plus fixed integer sizes of course). If that's what you meant, fine. (Can anyone assure me that Apple Silicon is little-endian? Is it really just a variant of ARM?)

[...]

The importlib overhead has been optimized over the years so there is no big low hanging fruit there.

What caused the original importlib overhead?

[...]

It doesn't need to be linked with the executable (as frozen modules)

If that's true, frozen modules wouldn't strictly need to be part of the executable either, right? (But either way this wouldn't make things faster, it would just add more flexibility to the build process.)

[...]

E.g. 1. load all the code for startup from disk, 2. unmarshal all the code, 3. execute top-level code for all modules.

I'm guessing you're saying this approach would be faster because each stage blows away the I-cache of the previous stage (the D-cache is blown anyways since the data is different each time, I assume?), and if we alternate per module we spend a lot of time refreshing the I-cache from memory? But I find it hard to believe that the per-module time is so small that it would make much of a difference.

[...]

Another issue: doing something lazily doesn't help performance if you end up doing anyhow. I think I was running into such an issue.

Depends on the application code, right? If the application imports io, there's not much of an advantage to delaying its import, even if it were to speed up getting to the >>> prompt. Ditto for encodings. OTOH even a large webapp (I'm thinking Dropbox here, but you could think Instagram or any Django or Flask app) probably doesn't call a lot of the code that it imports, so delaying construction of code objects until they are actually called should almost surely help. Maybe when you tried this you didn't try it with a large-enough app? Or maybe your approach didn't save enough work compared to just constructing a code object?

A premise of all these approaches is that unmarshalling code objects is slow. That should be easy enough to measure. E.g. compile everything in the stdlib, marshal.dumps() all the toplevel code objects (this will pull in the nested code objects too), and then time calling marshal.loads() on those strings a bunch. I've looked at this a tiny bit, and I'm thinking that either marshal.loads() isn't actually that slow, or it can be sped up by inlining a few key functions (e.g. r_long()).

My developing intuition is that a lot of other things happening at import time are actually slower, for example constructing classes (type objects are bulky) and various "init-time" initializations like initializing static tables, creating locks, and the like. Oh, and don't forget calling decorators and metaclasses! And random things like creating loggers. (Also I think someone just mentioned that enums are expensive to create?)

I also just realized that there's a pattern that I like that's actually pretty inefficient -- having a package __init__.py that imports all submodules of that package so it can re-export various classes. E.g. asyncio does this on a massive scale. Though to fix it we'd also have to start using nested imports at a massive scale, which has its own issues (even a cached import is much slower than LOAD_GLOBAL).

@methane
Copy link

methane commented May 29, 2021

Of course, without -S, we spend a ton of time (25-30 ms!) importing site, which pulls in the kitchen sink.

On my machine:

$ PYTHONPROFILEIMPORTTIME=x local/python-dev/bin/python3 -c pass
import time: self [us] | cumulative | imported package
import time:       145 |        145 |   _io
import time:        31 |         31 |   marshal
import time:       258 |        258 |   posix
import time:       516 |        948 | _frozen_importlib_external
import time:        81 |         81 |   time
import time:       157 |        238 | zipimport
import time:        46 |         46 |     _codecs
import time:       445 |        490 |   codecs
import time:       361 |        361 |   encodings.aliases
import time:       450 |       1300 | encodings
import time:       178 |        178 | encodings.utf_8
import time:        96 |         96 | _signal
import time:        48 |         48 |     _abc
import time:       205 |        253 |   abc
import time:       227 |        480 | io
import time:        49 |         49 |       _stat
import time:       229 |        278 |     stat
import time:       854 |        854 |     _collections_abc
import time:       132 |        132 |       genericpath
import time:       184 |        316 |     posixpath
import time:       555 |       2001 |   os
import time:       155 |        155 |   _sitebuiltins
import time:       165 |        165 |   sitecustomize
import time:        49 |         49 |   usercustomize
import time:       493 |       2860 | site

Maybe, slow pth files in your site-packages?

@nascheme
Copy link

We probably want to kept the on-disk files machine independent.

Cap'n Proto's solution for this seems to be to declare that most modern CPUs are little-endian, so it uses that (plus fixed integer sizes of course). If that's what you meant, fine.

What I was thinking is at one extreme, a startup bundle could be totally machine dependant. Something like writing out PyObject structures from memory to disk. Obviously you can't reload those without fixing up internal pointers, etc. Some dynamic languages used to work like that, e.g. emacs, smalltalk. Do a sort of core dump and then with some ugly code to reload it and revive it as a running process. Fast startup but non-portable and ugly. I don't recommend we try that approach.

The other extreme is complete portability, like we have now with the pyc format. Would it be okay to make startup bundles machine dependant if we get some speed boost? The startup bundle would be closely associated with the Python executable so it's not like you are putting them on PyPI or something. An example might be storing strings already in the memory representation that they will be used for the OS/machine. I think it is not worth giving up machine independence right now.

[...]

The importlib overhead has been optimized over the years so there is no big low hanging fruit there.

What caused the original importlib overhead?

When first developed, I think it was a pure Python implementation. Later, some hotspots were re-written as C code to reduce the overhead (e.g. import_find_and_load and similar in import.c).

[...]

It doesn't need to be linked with the executable (as frozen modules)

If that's true, frozen modules wouldn't strictly need to be part of the executable either, right? (But either way this wouldn't make things faster, it would just add more flexibility to the build process.)

Yes, the main reason would be for better flexibility. A downside is that if the startup bundle is missing, the executable can't do much. That doesn't seem too terrible to me. You already need the stuff in prefix to do much.

[...]

E.g. 1. load all the code for startup from disk, 2. unmarshal all the code, 3. execute top-level code for all modules.

I'm guessing you're saying this approach would be faster because each stage blows away the I-cache of the previous stage (the D-cache is blown anyways since the data is different each time, I assume?), and if we alternate per module we spend a lot of time refreshing the I-cache from memory? But I find it hard to believe that the per-module time is so small that it would make much of a difference.

Yes that was my thought and I think you are right that it is unlikely to help in a significant way. The more useful profiling report would be the main reason to do it.

[...]

Another issue: doing something lazily doesn't help performance if you end up doing anyhow. I think I was running into such an issue.

Depends on the application code, right? If the application imports io, there's not much of an advantage to delaying its import, even if it were to speed up getting to the >>> prompt. Ditto for encodings. OTOH even a large webapp (I'm thinking Dropbox here, but you could think Instagram or any Django or Flask app) probably doesn't call a lot of the code that it imports, so delaying construction of code objects until they are actually called should almost surely help. Maybe when you tried this you didn't try it with a large-enough app? Or maybe your approach didn't save enough work compared to just constructing a code object?

My experiment was quick and dirty. I still loaded the code object from disk into memory as a bytestring. I just deferred the unmarshal call on it. Not even reading the data would be faster (although you have a problem if the .pyc file goes away) but I didn't get to testing that. The Python startup modules are probably already refactored such that you don't load a bunch of functions you don't actually need.

A premise of all these approaches is that unmarshalling code objects is slow. That should be easy enough to measure. E.g. compile everything in the stdlib, marshal.dumps() all the toplevel code objects (this will pull in the nested code objects too), and then time calling marshal.loads() on those strings a bunch. I've looked at this a tiny bit, and I'm thinking that either marshal.loads() isn't actually that slow, or it can be sped up by inlining a few key functions (e.g. r_long()).

I did some profiling with Linux "perf". I'm not sure I did it correctly though. I wrote a small C program that repeatedly fork/execs ./python. The Python script just calls os._exit(0), in order to avoid the shutdown cost. Some hotspots:

Percent total Function
21.9 r_object
14.6 PyType_Ready
13.8 marshal_loads
13.1 PyObject_Malloc
9.2 update_one_slot
8.2 PyMarshal_ReadObjectFromString
4.2 os_listdir

Note that "Percent total" includes time spent in called functions. So, it seems unmarshal is not the dominate startup cost but probably worth some effort to optimize more. Putting things in a startup bundle could maybe eliminate the os_listdir which seems like a pretty easy win.

@nascheme
Copy link

In case someone is interested to try to reproduce the "perf" results. Here are some instructions. You need to compile with -fno-omit-frame-pointer -ggdb in OPT in order to get better results. Run perf like this:

perf record -F9999 --call-graph lbr ./a.out

or

perf record -F9999 --call-graph dwarf ./a.out

Use the former if you have an Intel CPU that supports it, since the recordings are a lot smaller (like 10X smaller). Once done recording, you can generate a report like this:

perf report -g graph,2 -n

My quick and dirty C program to run Python multiple times. Because perf is a statistical profiler, your code must be running for a fair amount of time.

#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

int main() {
    int pid;
    for (int i = 0; i < 4000; i++) {
        pid = fork();
        if (pid == 0) {
            execl("./python", "./python", "exit.py", NULL);
            return 0;
        }
        int status;
        wait(&status);
    }
}

exit.py is just

import os
os._exit(0)

The annotated source code feature is really neat. When on a C function, press the 'a' key to see annotated code. I like to push 't' until the left column shows Samples.

@nascheme
Copy link

Here's another report from Linux "perf". I hacked up my copy of Python to load startup module code from a single file. The unmarshal for all startup modules is done up-front as a batch, within the PyImport_InitStartupModules function. Then PyImport_ImportFrozenModuleObject and import_find_and_load will use those code objects if they exist in memory. This makes the profile easier to read since the unmarshal work is not interleaved with bytecode execution, etc.

The set of startup modules are:

_frozen_importlib
zipimport
codecs
encodings.aliases
encodings.utf_8
io
abc
site
os
stat
_collections_abc
posixpath
genericpath
_sitebuiltins

Profile report is below. I don't know why the total time is not 100% or why _PyEval_EvalFrameDefault is reported as taking more time than _start. That doesn't make sense to me. Something to do with recursive calls confusing the perf accounting?

Child Self Symbol
48.27% 3.53% _PyEval_EvalFrameDefault
47.78% 0.33% _PyEval_Vector
41.71% 0.00% pymain_init
41.51% 0.00% pymain_main
41.50% 0.00% Py_BytesMain
40.71% 0.00% Py_InitializeFromConfig
39.96% 0.00% __libc_start_main
39.52% 0.00% _start
38.99% 0.00% pyinit_core.constprop.0
37.82% 0.00% pycore_interp_init
28.16% 0.05% PyImport_ImportModuleLevelObject
26.14% 3.77% r_object
23.21% 0.00% PyImport_InitStartupModules
23.02% 0.00% PyMarshal_ReadObjectFromFile
20.42% 0.03% cfunction_vectorcall_FASTCALL_KEYWORDS
19.41% 0.01% PyEval_EvalCode
19.23% 0.07% _PyObject_MakeTpCall
17.55% 0.00% exec_code_in_module
16.70% 0.07% type_call
15.43% 0.04% builtin___build_class__
15.05% 0.02% object_vacall
15.01% 0.01% _PyObject_CallMethodIdObjArgs
12.97% 0.02% cfunction_vectorcall_O
11.40% 3.55% PyObject_Malloc
11.37% 1.81% PyType_Ready
11.19% 0.19% type_new
9.63% 9.25% _Py_dict_lookup
9.55% 0.00% 0xffffffffa5000ade
9.24% 0.01% cfunction_call
9.10% 0.04% method_vectorcall
8.70% 0.03% _PyObject_Call_Prepend
7.90% 1.09% PyDict_SetDefault

If the profiling can be trusted, PyMarshal_ReadObjectFromFile and PyEval_EvalCode are taking about equal amounts of time. So, optimizing the unmarshal part of startup should provide some benefit.

@nascheme
Copy link

nascheme commented May 31, 2021

If we want to reduce the unmarshal cost, the idea of storing code objects as static C structures seems worth investigating again. I dusted off Larry's not_another_use_of_the_word_frozen branch. It seems to be mostly working after a re-base on Python 3.8.10. I haven't done much for profiling it yet as I'm still trying to clean it up and fix bugs:

https://github.com/nascheme/cpython/tree/static_frozen

@vstinner
Copy link

My notes on Python startup time: https://pythondev.readthedocs.io/startup_time.html

Time to time, I reduce the number of imports done at startup. Python 3.10: https://docs.python.org/dev/whatsnew/3.10.html#optimizations

"The runpy module now imports fewer modules. The python3 -m module-name command startup time is 1.4x faster in average. On Linux, python3 -I -m module-name imports 69 modules on Python 3.9, whereas it only imports 51 modules (-18) on Python 3.10. (Contributed by Victor Stinner in bpo-41006 and bpo-41718.)"

pth files are really bad for startup time performance, I hope that one day we will able to remove them. https://www.python.org/dev/peps/pep-0648/ may be a good replacement.

@gvanrossum
Copy link
Collaborator

Thanks for your notes, @vstinner. You have already done so much work on so many of the problems I'm hoping to tackle!

One interesting realization I just had is that improving PYC file load times will help all kinds of startup times -- not just of how fast we get to the '>>>' prompt or how fast we can get to run a small script, but also how fast a large webapp starts that loads 100s or 1000s of modules before being ready to handle its first request. For web backend developers that is a pretty important thing: edit your code, restart server, refresh browser, rinse and repeat...

@carljm
Copy link

carljm commented Jun 2, 2021

also how fast a large webapp starts

Yes, this is a significant developer-experience issue for us at Instagram.

I think Cinder has done something like this but I don't know details (StrictModule annotation?).

StrictModules is a static analyzer (plus a bit of runtime support) that can validate conservatively that executing a module has no side effects outside that module. That means it doesn't trigger code execution from any non-strict module (via decorator, metaclass, __init_subclass__, or just normal code execution at module level), and if it triggers code execution from another strict module, that code in turn is validated as not having side effects. Knowing that module execution doesn't have external side effects is a useful property to enable other optimizations, including lazy imports and lazy code execution.

One idea is that strict modules could enable a more efficient form of "pyc" file that's more like a java class file instead of a bytecode cache; instead of persisting bytecode to be executed to construct the module, it could persist metadata about top-level classes and functions and variables in the module (including bytecode for functions and methods of course), potentially allowing for a more efficient "rehydration" of the module contents than can be achieved by executing bytecode. This is an idea we haven't really had time to validate or follow up on, though.

Ultimately though strict modules is a significant semantic change; it makes sense for large codebases but I think many Python developers would find it uncomfortable. The analyzer is sophisticated enough that it still allows almost all forms of dynamism at module top-level, as long as the effects are contained within the module, but many common patterns (particularly "registry" type uses of decorators/metaclasses/init_subclass) rely on external side effects of module import :/ So I doubt it's a productive path for generalized optimization of Python startup.

@gvanrossum
Copy link
Collaborator

Strict Modules are a fascinating subject, but I agree that the semantic changes make it hard to swallow -- especially in the context of "Faster CPython" where we're trying to bend backwards to avoid semantic changes, in order to be able to have the entire community benefit from the changes.

It's a shame, I have often wished for (and in the core dev team we have occasionally mused about) changes that would allow the bytecode compiler to generate e.g. a dedicated bytecode for len(x). (Not that that's a particularly compelling example. The "make module loading faster" use case seems much more important.)

@DinoV
Copy link

DinoV commented Jun 3, 2021

On the mmap front @markshannon and I had a short discussion here a PyCon, and I had explored a mmap'd data format which would combine multiple pyc files into a single file (an "icepack"). The focus there was more on trying to reduce memory overhead of the byte code between copy on write processes (see also https://bugs.python.org/issue36839). But I went ahead and pushed the code behind it to: https://github.com/facebookincubator/cinder/tree/cinder/3.8/Experiments/icepack - There's both a pure Python version as well as a C accelerator.

It's definitely oriented at having a pre-compilation step so it'd be more focused on improving the behavior of whole-app loading, and my focus wasn't really startup time but getting x-proc sharing. But I wanted to share it here in case anyone wanted to explore it further. There was also one additional possibility of reducing de-marshaling which is turning all of the strings into "legacy" strings where the string memory isn't allocated in-line, but I'm not sure if those have disappeared yet.

@gvanrossum
Copy link
Collaborator

Interesting. I recall that Greg Stein in the early 2000s had the same idea of supporting the buffer protocol (or an earlier, similar protocol) for code objects, though his motivation was to have the code preloaded with the application (presumably in some read-only data segment). I actually thought that we once supported that, but it seems we chickened out at that time as well.

Incidentally the design of code objects as immutable also relates to this, but we never managed to get the entire code object in read-only memory because of the reference counts. (Immortal objects would have solved that, at a price.)

I like Mark's latest design for faster PYC loading (linked earlier) enough that I am trying to prototype it (doing the PYC-writing step in Python, skipping the separate metadata segment for now, and embedding the format in marshal so that I don't have to update the frozen importlib code). It's all premised on the hope that in a large app module, many functions are never called, so the code object never needs to be materialized ("rehydrated" seems to be the term). Neil's profiling numbers give me hope that reducing PYC load time can make a dent in startup time -- we just have to see how much.

(PS The idea of doing prototypes for this kind of stuff in Python comes from Cinder.)

@methane
Copy link

methane commented Jun 6, 2021

Ah, good memory! History bears out that this existed before 3.7. I guess bad stuff would happen if the contents of the buffer was overwritten during execution.

There are some points we should consider if we really want to allow buffer object for co_code:

  • Overhead to get bytes from objects
  • Code object is not GC-tracked for now.
  • Code object is hashable/immutable for now.

Additionally, pros (startup time, or CoW memory) had not been proven. That's the main reason I was against for it.
Python eval don't incref/decref co_code. Number of bytes in PYC files is not equal to bytes we can save at CoW.

@gvanrossum
Copy link
Collaborator

Well, I have my first timings with my approximation of @markshannon's proposal, but it means more work is needed before we can prove it's faster. :-)

First, the raw numbers:

Classic load: 0.439
Classic exec: 0.110
New PYC load: 0.002
New PYC exec: 1.158

What this measures is the following:

  • A file with 100 functions
  • Each functions has 100 nonsense lines (a, b = b, a)
  • For both classic and new modes:
    • Construct a PYC file for this file
    • Load it 1000 times
    • Exec the loaded code 1000 times

So I am successful in that the "load" phase is almost for free. (It just creates one dehydrated code object with a reference to the raw bytes of the PYC file.)

But the "hydration" phase is slow. The toplevel code looks roughly like this:

LAZY_LOAD_CONSTANT <n> (this is the code object)
LAZY_LOAD_CONSTANT <n> (this is the name of the function)
MAKE_FUNCTION 0
STORE_NAME <n>

And this repeated 100 times. Executing this calls LAZY_LOAD_CONSTANT 200 times, for 200 different lazy objects -- half of them the 100 different code objects, half of them strings giving the 100 different function names. The 100 STORE_NAME instructions are lucky and find that the string representing the name has already been loaded by the preceding LOAD_LAZY_CONSTANT instruction.

I suspect that the real cost is in executing the bytecode for LAZY_LOAD_CONSTANT. Because we're missing the necessary infrastructure to just go execute some bytecode, I have to create a temporary code object and call it. That's likely too expensive, especially since each of those just executes these instructions:

MAKE_CODE_OBJECT <n>
RETURN_CONSTANT <n>

@gvanrossum
Copy link
Collaborator

gvanrossum commented Jun 23, 2021

Okay, good news. After talking to @markshannon I got rid of the temporary code and frame objects. Instead, I am now using a lightweight "activation record" that saves and restores the following:

  • stack bottom, stack pointer, stack top
  • first instr, next instr

For allocation I use _PyThreadState_PushLocals which is supposedly very fast (why do we have so many "fast" allocators? :-), and the value stack is an array that is allocated contiguously with that. All other interpreter state (frame, code object, "specials") is unchanged when one of these "mini-subroutines" runs. The instruction array is not copied -- first_instr points directly into the PYC data array.

Exception handling is not implemented -- the only error lazy constant evaluation might encounter is a memory error. A robust implementation should pop all activation records when an exception happens.

For the micro-benchmark I mentioned above, the new approach is now roughly twice as fast as the classic (marshal-based) version. (code)

(UPDATE: The 2x number is for Windows with MSVC; on macOS with clang it's only ~30% faster.)

(UPDATE 2: Some time later it's only 30% faster with MSVC. Not sure what I saw before.)

(FINAL UPDATE: Mystery solved. Alas, the speedup I was measuring was due to the cost of an audit hook added by the test package -- marshal emits an audit event for every code object it reads. See comment below.)

@gvanrossum
Copy link
Collaborator

The biggest remaining design issue is the following. In my prototype, which I based on Mark's design, there is a single array of lazily-loadable constants, and (separately) a single array of names. These correspond to co_consts and co_names, but the arrays (tuples, really) are shared between all code objects loaded from the same PYC file.

This doesn't scale so well -- in a large module you can easily have 1000s of constants or names, and whenever the size of either array exceeds 255, the corresponding instructions require prefixing with EXPANDED_ARG -- this makes the bytecode bulky and slow, and also it means we have to recompute all jump targets (in my prototype PYC writer I just give up when this happens, so I can only serialize toy modules).

A secondary problem with this is that code objects are now contained in their own co_consts array, making them GC cycles (and throwing dis.py for a loop when it tries to recursively find all code objects nested in a module's code object).

The simplest solution I can think of is to give each serialized code object two extra arrays of indexes, which map the per-code co_consts and co_names arrays to per-file arrays. At runtime there would then have to be separate per-file arrays and per-code-object arrays. This is somewhat inefficient space-wise, but the implementation is straightforward.

(If we were to skip the index arrays, we'd end up with the situation where if a name or constant is shared between many code objects, it would be hydrated many times, once per code object, and the resulting objects would not be shared unless interned.)

@iritkatriel
Copy link
Collaborator

The simplest solution I can think of is to give each serialized code object two extra arrays of indexes, which map the per-code co_consts and co_names arrays to per-file arrays. At runtime there would then have to be separate per-file arrays and per-code-object arrays. This is somewhat inefficient space-wise, but the implementation is straightforward.

Maybe this can reduce redirection only to the case of shared objects:

Each code object has a dedicated sub-section of the module level array (and knows its first-index). Builder.add returns (index, redirected), where redirected= 0 if the index entry has the actual const/name, and it is 1 if that entry has the index of the actual data. So if Builder.add added the item in a new index, redirected=0. If it reused an old index, it needs to check whether the index is within this code object's section or not (so maybe it needs the first_index passed in). If it found an index from a previous code object, it adds that index to a new slot and returns this new slot's index with redirected=1.

Then there needs to be an opcode that resolves the redirection and feeds the MAKE_* with the correct index. Or something along those lines.

Whether it's worth it depends how often there won't be a need for redirection.

@gvanrossum
Copy link
Collaborator

Let's see, take this example:

eggs = 0
ham = 1
spam = 2
def egg_sandwich():
    return eggs, spam
def ham_sandwich():
    return ham + spam + spam

There are three names here, "eggs", "spam" and "ham", and only "spam" is shared. The disassembly for the functions would be something like this (constant indices are relative here):

# Disassembly of egg_sandwich()
LOAD_NAME 0 (eggs)
LOAD_NAME 1 (spam)
MAKE_TUPLE 2
RETURN_VALUE
# Disassembly of ham_sandwich()
LOAD_NAME 0 (ham)
LOAD_NAME 1 (spam)
BINARY_ADD
LOAD_NAME 1 (spam)
BINARY_ADD
RETURN_VALUE

The strings section has four entries:

# Section for egg_sandwich() starts here:
0: "eggs"
1: "spam"
# Section for ham_sandwich() starts here:
2: "ham"
3: redirect(1)

String entries 0, 1 and 2 have offsets directly into the binary data, where they find the length and encoded data for the strings. String entry 3 has a redirect instead of an offsset into the binary data.

How to represent redirection? We can encode this by setting the lowest bit of the offset value, if we ensure that offsets into binary data are always even. This wastes the occasional byte but that seems minor. (Alternatively, the offset could be shifted left by one.)

Code to find the real offset would be something like this (compare to _PyHydra_UnicodeFromIndex()):

PyObject *
_PyHydra_UnicodeFromIndex(struct lazy_pyc *pyc, int index)
{
    if (0 <= index && index < pyc->n_strings) {
        uint32_t offset = pyc->string_offsets[index];
        if (offset & 1) {
            index = offset >> 1;
            assert(0 <= index && index < pyc_n_strings);
            offset = pyc->string_offsets[index];
        }
        return _PyHydra_UnicodeFromOffset(pyc, offset);
    }
    PyErr_Format(PyExc_SystemError, "String index %d out of range", index);
    return NULL;
}

We then change the opcode for LOAD_NAME as follows:

        case TARGET(LOAD_NAME): {
            PyObject *name = GETITEM(names, oparg);
            if (name == NULL) {
                name = _PyHydrate_LoadName(co->co_pyc, co->co_strings_start + oparg);  // **Changed**
                if (name == NULL) {
                    goto error;
                }
                Py_INCREF(name);  // **New**
                PyTuple_SET_ITEM(names, oparg, name);  // **New**
            }
            ...  // Rest is unchanged

Where co_strings_start is a new field in the code object that contains the start of the code object's strings section. In the example, for egg_sandwich() this would be 0, while for ham_sandwich() it would be 2.

Moreover we would need to update _PyHydra_LoadName() to first check if the string with the given index already exists in pyc->names:

name = PyTuple_GET_ITEM(pyc->names, index);
if (name != NULL) {
    Py_INCREF(name);
    return name;
}

So as to avoid constructing multiple copies of the same string ("spam" in our example).

Also, of course, when a code object is hydrated we have to set its co_names member to a freshly allocated tuple containing all NULL items, instead of the current code that sets it to a reference to pyc->names.

@iritkatriel
Copy link
Collaborator

How to represent redirection? We can encode this by setting the lowest bit of the offset value, if we ensure that offsets into binary data are always even. This wastes the occasional byte but that seems minor. (Alternatively, the offset could be shifted left by one.)

Either that, or add is this possible: add an opcode "RESOLVE_REDIRECT index" which puts the real index on the stack, and then the oparg to the next opcode is something like -1 which means "pop it from the stack".

@gvanrossum
Copy link
Collaborator

We could do that for the MAKE_STRING opcode, which is only used in the "mini-subroutines" called by LAZY_LOAD_CONSTANT, but for LOAD_NAME and friends there is an issue: inserting extra opcodes requires recomputing all jump targets, and also the line table and the exception table, which I'd rather not do (especially not in this prototype).

We could define the MAKE_STRING opcode as always using the "absolute" index -- in fact, it has to be defined that way, since these mini-subroutines don't belong to any particular code object: if a constant is shared the mini-subroutine could be invoked from any of the code objects that reference it, or even from another mini-subroutine.

Since I generate the mini-subroutines from scratch and they don't have line or exception tables, having to use EXTENDED_ARG in these is not a problem.

The changes I suggested for co_names also need to be used for co_consts. We can use the same trick of requiring real offsets to be even. In fact, for code objects the offset has to be a multiple of 4, since we interpret the offset as a pointer to an array of uint32_t values, and the format claims that n-bytes values are aligned on multiples of n bytes. (I think I don't have any code in pyro.py to ensure this, so maybe I've been lucky, or maybe the alignment requirement is not hard on Intel CPUs, and it just slows things down a bit.)

(Somewhat unrelated, the encoding for strings is currently a varint given the size of the encoded bytes, followed by that many bytes comprising the UTF-8-encoded value. This favors space over time. But maybe we ought to favor time over space here? We could store the size in code units as a uint32_t shifted left by 2, encoding the character width in the bottom 2 bits, followed by raw data bytes, either 1, 2 or 4 bytes per code unit. We might even have a spare bit to distinguish between ASCII and Latin-1 in the case where it's one byte per character. E.g.

  • 0 => ASCII
  • 1 => Latin-1
  • 2 => 2 bytes per code unit
  • 3 => 4 bytes per code unit

It's wort experimenting a bit here to see if this is actually faster, and we could scan the 100 or 4000 most popular PyPI packages to see what the total space wasted is compared to the current encoding. For a short ASCII string the wastage would be 3 bytes per string.)

@iritkatriel
Copy link
Collaborator

# Section for egg_sandwich() starts here:
0: "eggs"
1: "spam"
# Section for ham_sandwich() starts here:
2: "ham"
3: redirect(1)

String entries 0, 1 and 2 have offsets directly into the binary data, where they find the length and encoded data for the strings. String entry 3 has a redirect instead of an offsset into the binary data.

Wait -- if the 0,1,2 entries are just offsets into the binary data, then 3 might as well be that too, and just have the same offset as entry 1. No?

@gvanrossum
Copy link
Collaborator

Wait -- if the 0,1,2 entries are just offsets into the binary data, then 3 might as well be that too, and just have the same offset as entry 1. No?

If you do it that way, egg_sandwich() and ham_sandwich() each end up with their own copy of "spam". Now, we should really intern these strings too, which would take care of that, so possibly this is okay. The marshal format shares interned strings. Any string used in LOAD_NAME etc. is worth hashing (these are identifiers that are almost certainly used as dict keys, either globals or builtins, or instance/class attributes).

The sharing may be more important for MAKE_STRING, which is also used for "data" strings that aren't interned (things that don't look like identifiers) but are still kept shared both by the compiler and by marshal. (Marshal has a complex system for avoiding to write the same constant multiple times, see w_ref() and r_ref() and friends.) However, MAKE_STRING is only used from mini-subroutines, so we can (and must) set oparg to the absolute index. (Mini-subroutines don't have their own co_names and co_consts arrays.)

So if there was a third function that had a string constant "spam", it would have a mini-subroutine invoked by LAZY_LOAD_CONSTANT, which would look like this:

MAKE_STRING 1 ("spam")
RETURN_CONSTANT n

We need to support redirects for LAZY_LOAD_CONSTANT too, and it can follow the same general principle (a per-code-object "relative" index and an "absolute" index).

But we'll need two versions of that one! It's used both from regular code objects (where the bytecode rewriter substitutes it for LOAD_CONST, and the index needs to be relative to the code object) and from mini-subroutines, where we need to use the absolute index.

@gvanrossum
Copy link
Collaborator

gvanrossum commented Jun 28, 2021

Okay, so I am seeing something weird. When I run the "test_speed" subtest using the "test" package framework, it consistently reports a ~30% speedup, while when I run it using the unittest-based main(), it reports no speedup. Examples:

  • Running using the test framework:
PS C:\Users\gvanrossum\pyco> .\PCbuild\amd64\python -m test test_new_pyc -m test_speed
0:00:00 Run tests sequentially
0:00:00 [1/1] test_new_pyc
Starting speed test
Classic load: 0.380
Classic exec: 0.133
       Classic total: 0.514
New PYC load: 0.001
New PYC exec: 0.375
       New PYC total: 0.377
Classic-to-new ratio: 1.36 (new is 36% faster)

== Tests result: SUCCESS ==

1 test OK.

Total duration: 1.5 sec
Tests result: SUCCESS
  • Running using the unittest main():
PS C:\Users\gvanrossum\pyco> .\PCbuild\amd64\python -m test.test_new_pyc TestNewPyc.test_speed
Starting speed test
Classic load: 0.234
Classic exec: 0.110
       Classic total: 0.344
New PYC load: 0.001
New PYC exec: 0.354
       New PYC total: 0.355
Classic-to-new ratio: 0.97 (new is -3% faster)
.
----------------------------------------------------------------------
Ran 1 test in 0.925s

OK

I see similar differences on macOS.

At this point it's important to remind us how the test is run. The code is here. (Note that it seems that it always uses marshal.loads(). This is correct: the code that recognizes the new format lives at the top level in that function.)

The big difference seems to be that the individual "classic" times reported are much lower when running using unittest.main() than when running using the test package. What could cause this? Is the test package maybe messing with GC configuration or some other system parameter?

@gvanrossum
Copy link
Collaborator

Well, hm... In test/libregrtest/setup.py there's code that sets a dummy audit hook. If I comment that out, the classic running time using the test framework goes down to roughly what it is when using unittest.main(), and the "advantage" of the new code pretty much disappears. I consider this particular mystery solved.

gvanrossum added a commit to gvanrossum/cpython that referenced this issue Jun 29, 2021
It was causing an artificial slowdown in marshal.
See faster-cpython/ideas#32 (comment)
@gvanrossum
Copy link
Collaborator

gvanrossum commented Jul 12, 2021

I discussed the status of my experiment so far with Mark this morning, and we agreed to pause development and try some other experiments before we go further down this road.

  • Experiment A: the "pyco" branch in my repo

  • Experiment B: "streamline and tweak"

    • See Faster startup -- Experiment B -- streamline and tweak #65
    • try to speed up unmarshalling of code objects by low-level optimizations
    • streamline code objects (fewer redundant, computed fields; fewer small objects)
    • opcodes for immediate values (MAKE_INT for 0-255, LOAD_COMMON_CONSTANT -- the latter extended with some more cheap-to-construct immutable constants, like "", (), maybe even -1..-5)
  • Experiment C: "lazy unmarshal"

    • See Faster startup -- Experiment C -- lazy unmarshalling #66
    • borrows ideas from Experiment A: delay hydrating code objects until called, keep PYC file in memory
    • but dehydration just calls back into marshal
    • this might be less code changes compared to Exp. A, and doesn't need as many new opcodes

Probably it would be better to start writing up Experiments B and C in more detail in new issues here. I will get started with that.

@gvanrossum
Copy link
Collaborator

We now also have

@indygreg
Copy link

Fun fact: PyOxidizer's has support for importing .py/.pyc content from a memory-mapped file using 0-copy and the run-time code implementing that functionality is available on PyPI (https://pypi.org/project/oxidized-importer) and doesn't require the use of PyOxidizer. This extension module exposes an API (https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer_api_reference.html). And it is even possible to create your own packed resources files from your own content (https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer_freezing_applications.html). Conceptually it is similar to the zip importer, except a bit faster.

So if you wanted a quick way to to eliminate the stdlib importlib path importer and its I/O overhead to see its effect on performance, oxidized_importer should enable you to do that.

I've also isolated unmarshaling as the perf hotspot when oxidized_importer is in use and have been interested in Facebook's solution of an alternate serialization/unmarshaling mechanism but haven't had time to explore that. From my Mercurial developer days, I can also say that the module-level exec() on import can also be problematic. I thought Facebook's solution had a mechanism to serialize some of those evaluation results to avoid that overhead at import time. But this would presumably need proper bytecode operations and possibly new language features to properly support.

@gvanrossum
Copy link
Collaborator

gvanrossum commented Aug 28, 2021

Oh, here I thought PyOxidizer had solved my problem here already. But it's really closer to https://bugs.python.org/issue45020 -- it loads the marshal data in memory using 0-copy (not sure if that really doesn't copy anything ever, I am always wary of mmap, but I suspect it pays off if many processes share the same segment read-only).

But this issue is about eliminating marshal altogether, at least for those modules that are (nearly) always imported at startup. This would have helped Mercurial a bit. And yes, the next step would be eliminating the execution of the code, going directly to the collection of objects in the module dict after execution. But that's a much harder problem, since the results may depend on lots of context (e.g. os.environ, sys.argv, hash randomization, network, threads, other loaded modules, you name it). Facebook's "Strict Modules" (https://github.com/facebookincubator/cinder#strict-modules) and/or "Static Python" (same link next section).

@faster-cpython faster-cpython locked and limited conversation to collaborators Dec 2, 2021
@gramster gramster moved this to Todo in Fancy CPython Board Jan 10, 2022
@gramster gramster moved this from Todo to Other in Fancy CPython Board Jan 10, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests