-
-
Notifications
You must be signed in to change notification settings - Fork 32.3k
gh-114863: What's new in Python 3.13: JIT compiler #114862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
47a682f
2ff848a
b6a94a4
15b4cea
a77c76e
4f4d4ce
22b0565
63fc77b
a318554
68d60e4
18c4ba8
2312af0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -484,45 +484,137 @@ Optimizations | |
FreeBSD and Solaris. See the ``subprocess`` section above for details. | ||
(Contributed by Jakub Kulik in :gh:`113117`.) | ||
|
||
|
||
|
||
.. _whatsnew313-jit-compiler: | ||
|
||
|
||
Experimental JIT Compiler | ||
========================= | ||
|
||
When CPython is configured using the ``--enable-experimental-jit`` option, | ||
a just-in-time compiler is added which can speed up some Python programs. | ||
:Editor: Guido van Rossum, Ken Jin | ||
|
||
When CPython is configured using the ``--enable-experimental-jit`` build-time | ||
option, a just-in-time compiler is added which can speed up some Python | ||
programs. The internal architecture is roughly as follows. | ||
|
||
|
||
Intermediate Representation | ||
--------------------------- | ||
|
||
We start with specialized *Tier 1 bytecode*. | ||
See :ref:`What's new in 3.11 <whatsnew311-pep659>` for details. | ||
|
||
When the Tier 1 bytecode gets hot enough, the interpreter creates | ||
straight-line sequences of bytecode known as "traces", and translates that | ||
to a new, purely internal *Tier 2 IR*, a.k.a. micro-ops ("uops"). | ||
These straight-line sequences can cross function call boundaries, | ||
allowing more effective optimizations, listed in the next section. | ||
|
||
The Tier 2 IR uses the same stack-based VM as Tier 1, but the | ||
instruction format is better suited to translation to machine code. | ||
|
||
(Tier 2 IR contributed by Mark Shannon and Guido van Rossum.) | ||
|
||
|
||
Optimizations | ||
------------- | ||
|
||
We have several optimization and analysis passes for Tier 2 IR, which | ||
are applied before Tier 2 IR is interpreted or translated to machine code. | ||
These optimizations take unoptimized Tier 2 IR and produce optimized Tier 2 | ||
IR: | ||
|
||
* Type propagation -- through forward data-flow analysis, we infer | ||
and deduce information about types. This allows us to eliminate | ||
much of the overhead associated with dynamic typing in the future. | ||
|
||
* Constant propagation -- through forward data-flow analysis, we can reduce | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. General comment: I wonder if for people who are compiler noobs but like to learn more it would be nice to link to wikipedia or other accessible sources? In this case, it is easy to google and find this: https://en.wikipedia.org/wiki/Constant_folding but in other cases it might not be? (to be clear, I love the current text, it is very clearly written ❤️, so this does not seem as important) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the kind words! I think most of the terms are explained (I try to give a short summary of what they mean in each bullet point). Not all of this can be found in Wikipedia articles though, and some are found in particular programming language research papers. |
||
expressions like :: | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The internal architecture is roughly as follows. | ||
a = 1 | ||
b = 2 | ||
c = a + b | ||
|
||
* We start with specialized *Tier 1 bytecode*. | ||
See :ref:`What's new in 3.11 <whatsnew311-pep659>` for details. | ||
to :: | ||
|
||
* When the Tier 1 bytecode gets hot enough, it gets translated | ||
to a new, purely internal *Tier 2 IR*, a.k.a. micro-ops ("uops"). | ||
a = 1 | ||
b = 2 | ||
c = 3 | ||
|
||
* The Tier 2 IR uses the same stack-based VM as Tier 1, but the | ||
instruction format is better suited to translation to machine code. | ||
* Guard elimination -- through a combination of constant and type information, | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
we can eliminate type checks and other guards associated with operations. | ||
These guards validate specialized operations, but add a slight bit of | ||
overhead. For example, integer addition needs a type check that checks | ||
both operands are integers. As a proof of concept, we managed to eliminate | ||
over 70% of integer type checks in our own benchmarks. | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* We have several optimization passes for Tier 2 IR, which are applied | ||
before it is interpreted or translated to machine code. | ||
* Loop splitting -- after the first iteration, we gain a lot more type | ||
information. Thus, we peel the first iteration of loops to produce | ||
an optimized body that exploits this additional type information. | ||
This also achieves a similar effect to an optimization called | ||
loop-invariant code motion, but only for guards. | ||
|
||
* There is a Tier 2 interpreter, but it is mostly intended for debugging | ||
the earlier stages of the optimization pipeline. If the JIT is not | ||
enabled, the Tier 2 interpreter can be invoked by passing Python the | ||
``-X uops`` option or by setting the ``PYTHON_UOPS`` environment | ||
variable to ``1``. | ||
* Globals to constant promotion -- global value loads become constant | ||
loads, speeding them up and also allowing for more constant propagation. | ||
Comment on lines
+551
to
+552
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC this work is independent from the data-flow analysis part. It relies on dictionary watchers, which deserve a separate mention if they aren't already mentioned. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. dictionary watchers were implemented in 3.12, not 3.13. So it's already mentioned there. |
||
|
||
* When the ``--enable-experimental-jit`` option is used, the optimized | ||
Tier 2 IR is translated to machine code, which is then executed. | ||
This does not require additional runtime options. | ||
* This section is non-exhaustive and will be updated with further | ||
optimizations, up till CPython 3.13's release. | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* The machine code translation process uses an architecture called | ||
*copy-and-patch*. It has no runtime dependencies, but there is a new | ||
build-time dependency on LLVM. | ||
(Tier 2 optimizer contributed by Ken Jin, with implementation help | ||
by Guido van Rossum, Mark Shannon, and Jules Poon. Special thanks | ||
to Manuel Rigger and Martin Henz.) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we have to be a bit more careful here giving credit where due. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will remove Jules and Martin but I will insist on crediting Manuel, I've met with him consistently every 2 weeks to discuss this work (and adjacent experiments on optimizing CPython), and its technical details. This has been ongoing for over 5 months now. So I'm very thankful for his help. |
||
|
||
|
||
Execution Engine | ||
---------------- | ||
|
||
There are two execution engines for Tier 2 IR. | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The first is the Tier 2 interpreter, but it is mostly intended for debugging | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
the earlier stages of the optimization pipeline. If the JIT is not | ||
enabled, the Tier 2 interpreter can be invoked by passing Python the | ||
``-X uops`` option or by setting the ``PYTHON_UOPS`` environment | ||
variable to ``1``. | ||
|
||
The second is the JIT compiler. When the ``--enable-experimental-jit`` | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
build-time option is used, the optimized Tier 2 IR is translated to machine | ||
code, which is then executed. This does not require additional | ||
runtime options. | ||
|
||
The machine code translation process uses a technique called | ||
*copy-and-patch*. It has no runtime dependencies, but there is a new | ||
build-time dependency on LLVM. The main benefit of this technique is | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
fast compilation, reported as orders of magnitudes faster versus | ||
traditional compilation techniques in the paper linked below. The code | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
produced is slightly less optimized, but suitable for a baseline JIT | ||
compiler. | ||
|
||
(Copy-and-patch JIT compiler contributed by Brandt Bucher, | ||
directly inspired by the paper | ||
`Copy-and-Patch Compilation <https://fredrikbk.com/publications/copy-and-patch.pdf>`_ | ||
by Haoran Xu and Fredrik Kjolstad. For more information, | ||
`a talk <https://youtu.be/HxSHIpEQRjs?si=RwC78FcXrThIgFmY>`_ is available.) | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
Results and Future Work | ||
----------------------- | ||
|
||
The final performance results will be updated before CPython 3.13's release. | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The JIT compiler is rather unoptimized, and serves as the foundation | ||
for significant optimizations in future releases. As such, we do not | ||
expect the first iteration of the JIT compiler to produce a significant | ||
speedup. | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
About | ||
----- | ||
|
||
(JIT by Brandt Bucher, inspired by a paper by Haoran Xu and Fredrik Kjolstad. | ||
Tier 2 IR by Mark Shannon and Guido van Rossum. | ||
Tier 2 optimizer by Ken Jin.) | ||
This work was done by the Faster CPython team, and many other external | ||
contributors. The team consists of engineers from Microsoft, Meta, | ||
Quansight, and Bloomberg, who are either paid in part to do this, or | ||
volunteer in their free time. | ||
Fidget-Spinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
Deprecated | ||
|
Uh oh!
There was an error while loading. Please reload this page.