-
Notifications
You must be signed in to change notification settings - Fork 51
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Improve startup time. #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We also need to o initialize some extension modules, sys and built ons. |
If creating and/or destroying the new interpreter is what takes the time, then one possible way to speed it up is to freeze the object graph of a newly created interpreter. That would work roughly as follows:
We can now create the entire object graph for the interpreter by:
The above assumes that interpreters are fully independent. |
I've just remembered Running
|
I don't have any such info presently. I know there were investigations in the past that involved such profiling but I expect any resulting profiling data is outdated at this point. Regardless, it shouldn't take much effort to get at least some basic data. Furthermore, we can already get at least some insight by running |
Regarding Mark's strategy, can you explain what we would have to do when we need to change the object graph? IIUR currently we change some of the modules that are frozen (_frozen_importlib and friends), and then regenerate the frozen bytecode (for which we have tools). I am worried that your item (3) makes this process more complicated. Otherwise, it sounds like a decent strategy (many data structures are already static webs of pointers, the new thing is that we allow static webs of And yes, we need to work on profiling startup. |
FTR, facebook/instagram has been using a related strategy, which they described at the language summit a couple years back. |
Please don't forget |
Mark has a nascent proposal |
This is only beneficial if the size of mapped files exceeds a certain threshold, which is fairly large (probably larger than 100 KiB by now). Below that, copying is faster. With small files, I expect a run-time penalty even after loading because the less dense packing of information (distributed across a larger number of pages) results in increased TLB misses. And by default, Linux will only allow 65,530 separate mappings per process. |
My intuition is that it will be hard to show that mmap is faster than copying here, so I would personally be happy if the first version just read (i.e., copied) the file into memory. |
The point of the proposal is to avoid the overhead of unmarshalling, whether the file is copied or mapped isn't so important. The important thing is that the pyc format is position independent (no pointers), is immutable, and requires minimal decoding before being usable. |
Oh, that's really neat! |
I just discovered
The biggest cumulative time (4.4 ms) is the The next biggest cost is Of course, without Here's the raw data for that:
|
Some comments about this based on work I did at previous core sprints. Mark's ideas seem good to me and are along the same lines I was thinking. The unmarshal step is fairly expensive and I think we could reduce that with a change of the code object layout. Currently, the VM expects code objects to be (mostly) composed of PyObjects. It doesn't need to be that way and we could create a code object memory layout that's much cheaper to load from disk. I suggested something similar in spirit to Cap’n Proto. We probably want to kept the on-disk files machine independent. The importlib overhead has been optimized over the years so there is no big low hanging fruit there. However, it seem seems wasteful how much work is done by importlib just to start up. I think we could dump all needed startup modules into a kind of optimized bundle (e.g. better optimized than a zipped package) and eliminate the importlib overhead. E.g. on startup call a C-function that unmarshals all the code on the bundle and then executes some bytecode to finish the startup. It doesn't need to be linked with the executable (as frozen modules) but could be an external file, e.g. in When Python starts, the interleaving of Another idea: try to reduce the amount of work done when executing top-level module code. Quite a bit of work has been done over the years to try to reduce startup time and therefore what's done at the top-level of modules imported at startup. E.g. do it the first time a function is called, rather than at import time. However, maybe we can do better. My lazy top-level code execution idea was one approach. The major killer of that was that metaclasses can basically execute anything. My AST walker couldn't find too much that could be lazily executed. I think Cinder has done something like this but I don't know details (StrictModule annotation?). Another issue: doing something lazily doesn't help performance if you end up doing anyhow. I think I was running into such an issue. A twist on the lazy code execution is to just defer the loading/unmarshal of code objects of functions, until the function is actually called. It seems likely that quite a few functions in modules loaded at startup are never actually executed. So, does it help just to never load the code for them? I had a prototype of this idea but it didn't show much of a win. Might be worth revisiting. The original idea was inspired by something Larry did with PHP years ago. Another idea: allow some code to be executed at module compile time. I.e. Python code that gets called by the compiler to compute objects that end up inside the .pyc. A simple example, assume my module needs to compute a dict that will become a constant global. It could be possible to do that work at compile time (rather than at import time) and then store the result of that computation in the .pyc. Sometime like:
Obviously there would be a restrictions on what you can do at compile time. This would be an advanced feature and used carefully. Another place it could be used is to pre-compile regular expressions to something that can be quickly loaded at import time and still have good runtime performance. You can do something a bit like this by having code that generates a .py file and then compile that. However, having it a feature of the Python compiler would be nice (aside from the language complexity increase). Perhaps things like dataclasses could use the feature to do more work at compile time. |
The |
Cap'n Proto's solution for this seems to be to declare that most modern CPUs are little-endian, so it uses that (plus fixed integer sizes of course). If that's what you meant, fine. (Can anyone assure me that Apple Silicon is little-endian? Is it really just a variant of ARM?) [...]
What caused the original importlib overhead? [...]
If that's true, frozen modules wouldn't strictly need to be part of the executable either, right? (But either way this wouldn't make things faster, it would just add more flexibility to the build process.) [...]
I'm guessing you're saying this approach would be faster because each stage blows away the I-cache of the previous stage (the D-cache is blown anyways since the data is different each time, I assume?), and if we alternate per module we spend a lot of time refreshing the I-cache from memory? But I find it hard to believe that the per-module time is so small that it would make much of a difference. [...]
Depends on the application code, right? If the application imports A premise of all these approaches is that unmarshalling code objects is slow. That should be easy enough to measure. E.g. compile everything in the stdlib, marshal.dumps() all the toplevel code objects (this will pull in the nested code objects too), and then time calling marshal.loads() on those strings a bunch. I've looked at this a tiny bit, and I'm thinking that either marshal.loads() isn't actually that slow, or it can be sped up by inlining a few key functions (e.g. r_long()). My developing intuition is that a lot of other things happening at import time are actually slower, for example constructing classes (type objects are bulky) and various "init-time" initializations like initializing static tables, creating locks, and the like. Oh, and don't forget calling decorators and metaclasses! And random things like creating loggers. (Also I think someone just mentioned that enums are expensive to create?) I also just realized that there's a pattern that I like that's actually pretty inefficient -- having a package |
On my machine:
Maybe, slow pth files in your site-packages? |
What I was thinking is at one extreme, a startup bundle could be totally machine dependant. Something like writing out PyObject structures from memory to disk. Obviously you can't reload those without fixing up internal pointers, etc. Some dynamic languages used to work like that, e.g. emacs, smalltalk. Do a sort of core dump and then with some ugly code to reload it and revive it as a running process. Fast startup but non-portable and ugly. I don't recommend we try that approach. The other extreme is complete portability, like we have now with the pyc format. Would it be okay to make startup bundles machine dependant if we get some speed boost? The startup bundle would be closely associated with the Python executable so it's not like you are putting them on PyPI or something. An example might be storing strings already in the memory representation that they will be used for the OS/machine. I think it is not worth giving up machine independence right now.
When first developed, I think it was a pure Python implementation. Later, some hotspots were re-written as C code to reduce the overhead (e.g.
Yes, the main reason would be for better flexibility. A downside is that if the startup bundle is missing, the executable can't do much. That doesn't seem too terrible to me. You already need the stuff in
Yes that was my thought and I think you are right that it is unlikely to help in a significant way. The more useful profiling report would be the main reason to do it.
My experiment was quick and dirty. I still loaded the code object from disk into memory as a bytestring. I just deferred the unmarshal call on it. Not even reading the data would be faster (although you have a problem if the .pyc file goes away) but I didn't get to testing that. The Python startup modules are probably already refactored such that you don't load a bunch of functions you don't actually need.
I did some profiling with Linux "perf". I'm not sure I did it correctly though. I wrote a small C program that repeatedly fork/execs
Note that "Percent total" includes time spent in called functions. So, it seems unmarshal is not the dominate startup cost but probably worth some effort to optimize more. Putting things in a startup bundle could maybe eliminate the |
In case someone is interested to try to reproduce the "perf" results. Here are some instructions. You need to compile with
or
Use the former if you have an Intel CPU that supports it, since the recordings are a lot smaller (like 10X smaller). Once done recording, you can generate a report like this:
My quick and dirty C program to run Python multiple times. Because
exit.py is just
The annotated source code feature is really neat. When on a C function, press the 'a' key to see annotated code. I like to push 't' until the left column shows |
Here's another report from Linux "perf". I hacked up my copy of Python to load startup module code from a single file. The unmarshal for all startup modules is done up-front as a batch, within the The set of startup modules are:
Profile report is below. I don't know why the total time is not 100% or why
If the profiling can be trusted, |
If we want to reduce the unmarshal cost, the idea of storing code objects as static C structures seems worth investigating again. I dusted off Larry's https://github.com/nascheme/cpython/tree/static_frozen |
My notes on Python startup time: https://pythondev.readthedocs.io/startup_time.html Time to time, I reduce the number of imports done at startup. Python 3.10: https://docs.python.org/dev/whatsnew/3.10.html#optimizations "The runpy module now imports fewer modules. The python3 -m module-name command startup time is 1.4x faster in average. On Linux, python3 -I -m module-name imports 69 modules on Python 3.9, whereas it only imports 51 modules (-18) on Python 3.10. (Contributed by Victor Stinner in bpo-41006 and bpo-41718.)" pth files are really bad for startup time performance, I hope that one day we will able to remove them. https://www.python.org/dev/peps/pep-0648/ may be a good replacement. |
Thanks for your notes, @vstinner. You have already done so much work on so many of the problems I'm hoping to tackle! One interesting realization I just had is that improving PYC file load times will help all kinds of startup times -- not just of how fast we get to the '>>>' prompt or how fast we can get to run a small script, but also how fast a large webapp starts that loads 100s or 1000s of modules before being ready to handle its first request. For web backend developers that is a pretty important thing: edit your code, restart server, refresh browser, rinse and repeat... |
Yes, this is a significant developer-experience issue for us at Instagram.
StrictModules is a static analyzer (plus a bit of runtime support) that can validate conservatively that executing a module has no side effects outside that module. That means it doesn't trigger code execution from any non-strict module (via decorator, metaclass, One idea is that strict modules could enable a more efficient form of "pyc" file that's more like a java class file instead of a bytecode cache; instead of persisting bytecode to be executed to construct the module, it could persist metadata about top-level classes and functions and variables in the module (including bytecode for functions and methods of course), potentially allowing for a more efficient "rehydration" of the module contents than can be achieved by executing bytecode. This is an idea we haven't really had time to validate or follow up on, though. Ultimately though strict modules is a significant semantic change; it makes sense for large codebases but I think many Python developers would find it uncomfortable. The analyzer is sophisticated enough that it still allows almost all forms of dynamism at module top-level, as long as the effects are contained within the module, but many common patterns (particularly "registry" type uses of decorators/metaclasses/init_subclass) rely on external side effects of module import :/ So I doubt it's a productive path for generalized optimization of Python startup. |
Strict Modules are a fascinating subject, but I agree that the semantic changes make it hard to swallow -- especially in the context of "Faster CPython" where we're trying to bend backwards to avoid semantic changes, in order to be able to have the entire community benefit from the changes. It's a shame, I have often wished for (and in the core dev team we have occasionally mused about) changes that would allow the bytecode compiler to generate e.g. a dedicated bytecode for |
On the It's definitely oriented at having a pre-compilation step so it'd be more focused on improving the behavior of whole-app loading, and my focus wasn't really startup time but getting x-proc sharing. But I wanted to share it here in case anyone wanted to explore it further. There was also one additional possibility of reducing de-marshaling which is turning all of the strings into "legacy" strings where the string memory isn't allocated in-line, but I'm not sure if those have disappeared yet. |
Interesting. I recall that Greg Stein in the early 2000s had the same idea of supporting the buffer protocol (or an earlier, similar protocol) for code objects, though his motivation was to have the code preloaded with the application (presumably in some read-only data segment). I actually thought that we once supported that, but it seems we chickened out at that time as well. Incidentally the design of code objects as immutable also relates to this, but we never managed to get the entire code object in read-only memory because of the reference counts. (Immortal objects would have solved that, at a price.) I like Mark's latest design for faster PYC loading (linked earlier) enough that I am trying to prototype it (doing the PYC-writing step in Python, skipping the separate metadata segment for now, and embedding the format in marshal so that I don't have to update the frozen importlib code). It's all premised on the hope that in a large app module, many functions are never called, so the code object never needs to be materialized ("rehydrated" seems to be the term). Neil's profiling numbers give me hope that reducing PYC load time can make a dent in startup time -- we just have to see how much. (PS The idea of doing prototypes for this kind of stuff in Python comes from Cinder.) |
There are some points we should consider if we really want to allow buffer object for co_code:
Additionally, pros (startup time, or CoW memory) had not been proven. That's the main reason I was against for it. |
Well, I have my first timings with my approximation of @markshannon's proposal, but it means more work is needed before we can prove it's faster. :-) First, the raw numbers:
What this measures is the following:
So I am successful in that the "load" phase is almost for free. (It just creates one dehydrated code object with a reference to the raw bytes of the PYC file.) But the "hydration" phase is slow. The toplevel code looks roughly like this:
And this repeated 100 times. Executing this calls LAZY_LOAD_CONSTANT 200 times, for 200 different lazy objects -- half of them the 100 different code objects, half of them strings giving the 100 different function names. The 100 STORE_NAME instructions are lucky and find that the string representing the name has already been loaded by the preceding LOAD_LAZY_CONSTANT instruction. I suspect that the real cost is in executing the bytecode for LAZY_LOAD_CONSTANT. Because we're missing the necessary infrastructure to just go execute some bytecode, I have to create a temporary code object and call it. That's likely too expensive, especially since each of those just executes these instructions:
|
Okay, good news. After talking to @markshannon I got rid of the temporary code and frame objects. Instead, I am now using a lightweight "activation record" that saves and restores the following:
For allocation I use Exception handling is not implemented -- the only error lazy constant evaluation might encounter is a memory error. A robust implementation should pop all activation records when an exception happens. For the micro-benchmark I mentioned above, the new approach is now roughly twice as fast as the classic (marshal-based) version. (code) (UPDATE: The 2x number is for Windows with MSVC; on macOS with clang it's only ~30% faster.) (UPDATE 2: Some time later it's only 30% faster with MSVC. Not sure what I saw before.) (FINAL UPDATE: Mystery solved. Alas, the speedup I was measuring was due to the cost of an audit hook added by the test package -- marshal emits an audit event for every code object it reads. See comment below.) |
The biggest remaining design issue is the following. In my prototype, which I based on Mark's design, there is a single array of lazily-loadable constants, and (separately) a single array of names. These correspond to co_consts and co_names, but the arrays (tuples, really) are shared between all code objects loaded from the same PYC file. This doesn't scale so well -- in a large module you can easily have 1000s of constants or names, and whenever the size of either array exceeds 255, the corresponding instructions require prefixing with EXPANDED_ARG -- this makes the bytecode bulky and slow, and also it means we have to recompute all jump targets (in my prototype PYC writer I just give up when this happens, so I can only serialize toy modules). A secondary problem with this is that code objects are now contained in their own co_consts array, making them GC cycles (and throwing dis.py for a loop when it tries to recursively find all code objects nested in a module's code object). The simplest solution I can think of is to give each serialized code object two extra arrays of indexes, which map the per-code co_consts and co_names arrays to per-file arrays. At runtime there would then have to be separate per-file arrays and per-code-object arrays. This is somewhat inefficient space-wise, but the implementation is straightforward. (If we were to skip the index arrays, we'd end up with the situation where if a name or constant is shared between many code objects, it would be hydrated many times, once per code object, and the resulting objects would not be shared unless interned.) |
Maybe this can reduce redirection only to the case of shared objects: Each code object has a dedicated sub-section of the module level array (and knows its first-index). Builder.add returns (index, redirected), where redirected= 0 if the index entry has the actual const/name, and it is 1 if that entry has the index of the actual data. So if Builder.add added the item in a new index, redirected=0. If it reused an old index, it needs to check whether the index is within this code object's section or not (so maybe it needs the first_index passed in). If it found an index from a previous code object, it adds that index to a new slot and returns this new slot's index with redirected=1. Then there needs to be an opcode that resolves the redirection and feeds the MAKE_* with the correct index. Or something along those lines. Whether it's worth it depends how often there won't be a need for redirection. |
Let's see, take this example: eggs = 0
ham = 1
spam = 2
def egg_sandwich():
return eggs, spam
def ham_sandwich():
return ham + spam + spam There are three names here, "eggs", "spam" and "ham", and only "spam" is shared. The disassembly for the functions would be something like this (constant indices are relative here):
The strings section has four entries:
String entries 0, 1 and 2 have offsets directly into the binary data, where they find the length and encoded data for the strings. String entry 3 has a redirect instead of an offsset into the binary data. How to represent redirection? We can encode this by setting the lowest bit of the offset value, if we ensure that offsets into binary data are always even. This wastes the occasional byte but that seems minor. (Alternatively, the offset could be shifted left by one.) Code to find the real offset would be something like this (compare to _PyHydra_UnicodeFromIndex()): PyObject *
_PyHydra_UnicodeFromIndex(struct lazy_pyc *pyc, int index)
{
if (0 <= index && index < pyc->n_strings) {
uint32_t offset = pyc->string_offsets[index];
if (offset & 1) {
index = offset >> 1;
assert(0 <= index && index < pyc_n_strings);
offset = pyc->string_offsets[index];
}
return _PyHydra_UnicodeFromOffset(pyc, offset);
}
PyErr_Format(PyExc_SystemError, "String index %d out of range", index);
return NULL;
} We then change the opcode for LOAD_NAME as follows: case TARGET(LOAD_NAME): {
PyObject *name = GETITEM(names, oparg);
if (name == NULL) {
name = _PyHydrate_LoadName(co->co_pyc, co->co_strings_start + oparg); // **Changed**
if (name == NULL) {
goto error;
}
Py_INCREF(name); // **New**
PyTuple_SET_ITEM(names, oparg, name); // **New**
}
... // Rest is unchanged Where Moreover we would need to update _PyHydra_LoadName() to first check if the string with the given index already exists in name = PyTuple_GET_ITEM(pyc->names, index);
if (name != NULL) {
Py_INCREF(name);
return name;
} So as to avoid constructing multiple copies of the same string ("spam" in our example). Also, of course, when a code object is hydrated we have to set its |
Either that, or add is this possible: add an opcode "RESOLVE_REDIRECT index" which puts the real index on the stack, and then the oparg to the next opcode is something like -1 which means "pop it from the stack". |
We could do that for the MAKE_STRING opcode, which is only used in the "mini-subroutines" called by LAZY_LOAD_CONSTANT, but for LOAD_NAME and friends there is an issue: inserting extra opcodes requires recomputing all jump targets, and also the line table and the exception table, which I'd rather not do (especially not in this prototype). We could define the MAKE_STRING opcode as always using the "absolute" index -- in fact, it has to be defined that way, since these mini-subroutines don't belong to any particular code object: if a constant is shared the mini-subroutine could be invoked from any of the code objects that reference it, or even from another mini-subroutine. Since I generate the mini-subroutines from scratch and they don't have line or exception tables, having to use EXTENDED_ARG in these is not a problem. The changes I suggested for co_names also need to be used for co_consts. We can use the same trick of requiring real offsets to be even. In fact, for code objects the offset has to be a multiple of 4, since we interpret the offset as a pointer to an array of (Somewhat unrelated, the encoding for strings is currently a varint given the size of the encoded bytes, followed by that many bytes comprising the UTF-8-encoded value. This favors space over time. But maybe we ought to favor time over space here? We could store the size in code units as a uint32_t shifted left by 2, encoding the character width in the bottom 2 bits, followed by raw data bytes, either 1, 2 or 4 bytes per code unit. We might even have a spare bit to distinguish between ASCII and Latin-1 in the case where it's one byte per character. E.g.
It's wort experimenting a bit here to see if this is actually faster, and we could scan the 100 or 4000 most popular PyPI packages to see what the total space wasted is compared to the current encoding. For a short ASCII string the wastage would be 3 bytes per string.) |
Wait -- if the 0,1,2 entries are just offsets into the binary data, then 3 might as well be that too, and just have the same offset as entry 1. No? |
If you do it that way, egg_sandwich() and ham_sandwich() each end up with their own copy of "spam". Now, we should really intern these strings too, which would take care of that, so possibly this is okay. The marshal format shares interned strings. Any string used in LOAD_NAME etc. is worth hashing (these are identifiers that are almost certainly used as dict keys, either globals or builtins, or instance/class attributes). The sharing may be more important for MAKE_STRING, which is also used for "data" strings that aren't interned (things that don't look like identifiers) but are still kept shared both by the compiler and by marshal. (Marshal has a complex system for avoiding to write the same constant multiple times, see w_ref() and r_ref() and friends.) However, MAKE_STRING is only used from mini-subroutines, so we can (and must) set oparg to the absolute index. (Mini-subroutines don't have their own co_names and co_consts arrays.) So if there was a third function that had a string constant "spam", it would have a mini-subroutine invoked by LAZY_LOAD_CONSTANT, which would look like this:
We need to support redirects for LAZY_LOAD_CONSTANT too, and it can follow the same general principle (a per-code-object "relative" index and an "absolute" index). But we'll need two versions of that one! It's used both from regular code objects (where the bytecode rewriter substitutes it for LOAD_CONST, and the index needs to be relative to the code object) and from mini-subroutines, where we need to use the absolute index. |
Okay, so I am seeing something weird. When I run the "test_speed" subtest using the "test" package framework, it consistently reports a ~30% speedup, while when I run it using the unittest-based main(), it reports no speedup. Examples:
I see similar differences on macOS. At this point it's important to remind us how the test is run. The code is here. (Note that it seems that it always uses marshal.loads(). This is correct: the code that recognizes the new format lives at the top level in that function.) The big difference seems to be that the individual "classic" times reported are much lower when running using unittest.main() than when running using the test package. What could cause this? Is the test package maybe messing with GC configuration or some other system parameter? |
Well, hm... In test/libregrtest/setup.py there's code that sets a dummy audit hook. If I comment that out, the classic running time using the test framework goes down to roughly what it is when using unittest.main(), and the "advantage" of the new code pretty much disappears. I consider this particular mystery solved. |
It was causing an artificial slowdown in marshal. See faster-cpython/ideas#32 (comment)
I discussed the status of my experiment so far with Mark this morning, and we agreed to pause development and try some other experiments before we go further down this road.
Probably it would be better to start writing up Experiments B and C in more detail in new issues here. I will get started with that. |
We now also have
|
Fun fact: PyOxidizer's has support for importing .py/.pyc content from a memory-mapped file using 0-copy and the run-time code implementing that functionality is available on PyPI (https://pypi.org/project/oxidized-importer) and doesn't require the use of PyOxidizer. This extension module exposes an API (https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer_api_reference.html). And it is even possible to create your own packed resources files from your own content (https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer_freezing_applications.html). Conceptually it is similar to the zip importer, except a bit faster. So if you wanted a quick way to to eliminate the stdlib I've also isolated unmarshaling as the perf hotspot when |
Oh, here I thought PyOxidizer had solved my problem here already. But it's really closer to https://bugs.python.org/issue45020 -- it loads the marshal data in memory using 0-copy (not sure if that really doesn't copy anything ever, I am always wary of mmap, but I suspect it pays off if many processes share the same segment read-only). But this issue is about eliminating marshal altogether, at least for those modules that are (nearly) always imported at startup. This would have helped Mercurial a bit. And yes, the next step would be eliminating the execution of the code, going directly to the collection of objects in the module dict after execution. But that's a much harder problem, since the results may depend on lots of context (e.g. os.environ, sys.argv, hash randomization, network, threads, other loaded modules, you name it). Facebook's "Strict Modules" (https://github.com/facebookincubator/cinder#strict-modules) and/or "Static Python" (same link next section). |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Before we attempt this, we need to know where the time is spent.
@ericsnowcurrently Do you have any profiling information from running
python -S -c "print('Hi')"
, or a similar script?It takes about 10 ms on my machine for 3.9 which seems a long time to print "Hi".
For comparison, 2.7 takes about 7ms.
I suspect there is a lot of incidental stuff going on that we can eliminate.
The tasks that have to be performed to run
python -S -c "print('Hi')"
are:print(Hi")
print(Hi")
None of those tasks should take long, so what is slow?
The text was updated successfully, but these errors were encountered: