Skip to content

gh-114863: What's new in Python 3.13: JIT compiler #114862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
142 changes: 117 additions & 25 deletions Doc/whatsnew/3.13.rst
Original file line number Diff line number Diff line change
Expand Up @@ -484,45 +484,137 @@ Optimizations
FreeBSD and Solaris. See the ``subprocess`` section above for details.
(Contributed by Jakub Kulik in :gh:`113117`.)



.. _whatsnew313-jit-compiler:


Experimental JIT Compiler
=========================

When CPython is configured using the ``--enable-experimental-jit`` option,
a just-in-time compiler is added which can speed up some Python programs.
:Editor: Guido van Rossum, Ken Jin

When CPython is configured using the ``--enable-experimental-jit`` build-time
option, a just-in-time compiler is added which can speed up some Python
programs. The internal architecture is roughly as follows.


Intermediate Representation
---------------------------

We start with specialized *Tier 1 bytecode*.
See :ref:`What's new in 3.11 <whatsnew311-pep659>` for details.

When the Tier 1 bytecode gets hot enough, the interpreter creates
straight-line sequences of bytecode known as "traces", and translates that
to a new, purely internal *Tier 2 IR*, a.k.a. micro-ops ("uops").
These straight-line sequences can cross function call boundaries,
allowing more effective optimizations, listed in the next section.

The Tier 2 IR uses the same stack-based VM as Tier 1, but the
instruction format is better suited to translation to machine code.

(Tier 2 IR contributed by Mark Shannon and Guido van Rossum.)


Optimizations
-------------

We have several optimization and analysis passes for Tier 2 IR, which
are applied before Tier 2 IR is interpreted or translated to machine code.
These optimizations take unoptimized Tier 2 IR and produce optimized Tier 2
IR:

* Type propagation -- through forward data-flow analysis, we infer
and deduce information about types. This allows us to eliminate
much of the overhead associated with dynamic typing in the future.

* Constant propagation -- through forward data-flow analysis, we can reduce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: I wonder if for people who are compiler noobs but like to learn more it would be nice to link to wikipedia or other accessible sources? In this case, it is easy to google and find this: https://en.wikipedia.org/wiki/Constant_folding but in other cases it might not be?

(to be clear, I love the current text, it is very clearly written ❤️, so this does not seem as important)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the kind words! I think most of the terms are explained (I try to give a short summary of what they mean in each bullet point). Not all of this can be found in Wikipedia articles though, and some are found in particular programming language research papers.

expressions like ::

The internal architecture is roughly as follows.
a = 1
b = 2
c = a + b

* We start with specialized *Tier 1 bytecode*.
See :ref:`What's new in 3.11 <whatsnew311-pep659>` for details.
to ::

* When the Tier 1 bytecode gets hot enough, it gets translated
to a new, purely internal *Tier 2 IR*, a.k.a. micro-ops ("uops").
a = 1
b = 2
c = 3

* The Tier 2 IR uses the same stack-based VM as Tier 1, but the
instruction format is better suited to translation to machine code.
* Guard elimination -- through a combination of constant and type information,
we can eliminate type checks and other guards associated with operations.
These guards validate specialized operations, but add a slight bit of
overhead. For example, integer addition needs a type check that checks
both operands are integers. As a proof of concept, we managed to eliminate
over 70% of integer type checks in our own benchmarks.

* We have several optimization passes for Tier 2 IR, which are applied
before it is interpreted or translated to machine code.
* Loop splitting -- after the first iteration, we gain a lot more type
information. Thus, we peel the first iteration of loops to produce
an optimized body that exploits this additional type information.
This also achieves a similar effect to an optimization called
loop-invariant code motion, but only for guards.

* There is a Tier 2 interpreter, but it is mostly intended for debugging
the earlier stages of the optimization pipeline. If the JIT is not
enabled, the Tier 2 interpreter can be invoked by passing Python the
``-X uops`` option or by setting the ``PYTHON_UOPS`` environment
variable to ``1``.
* Globals to constant promotion -- global value loads become constant
loads, speeding them up and also allowing for more constant propagation.
Comment on lines +551 to +552
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this work is independent from the data-flow analysis part. It relies on dictionary watchers, which deserve a separate mention if they aren't already mentioned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dictionary watchers were implemented in 3.12, not 3.13. So it's already mentioned there.


* When the ``--enable-experimental-jit`` option is used, the optimized
Tier 2 IR is translated to machine code, which is then executed.
This does not require additional runtime options.
* This section is non-exhaustive and will be updated with further
optimizations, up till CPython 3.13's release.

* The machine code translation process uses an architecture called
*copy-and-patch*. It has no runtime dependencies, but there is a new
build-time dependency on LLVM.
(Tier 2 optimizer contributed by Ken Jin, with implementation help
by Guido van Rossum, Mark Shannon, and Jules Poon. Special thanks
to Manuel Rigger and Martin Henz.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to be a bit more careful here giving credit where due.

Copy link
Member Author

@Fidget-Spinner Fidget-Spinner Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove Jules and Martin but I will insist on crediting Manuel, I've met with him consistently every 2 weeks to discuss this work (and adjacent experiments on optimizing CPython), and its technical details. This has been ongoing for over 5 months now. So I'm very thankful for his help.



Execution Engine
----------------

There are two execution engines for Tier 2 IR.

The first is the Tier 2 interpreter, but it is mostly intended for debugging
the earlier stages of the optimization pipeline. If the JIT is not
enabled, the Tier 2 interpreter can be invoked by passing Python the
``-X uops`` option or by setting the ``PYTHON_UOPS`` environment
variable to ``1``.

The second is the JIT compiler. When the ``--enable-experimental-jit``
build-time option is used, the optimized Tier 2 IR is translated to machine
code, which is then executed. This does not require additional
runtime options.

The machine code translation process uses a technique called
*copy-and-patch*. It has no runtime dependencies, but there is a new
build-time dependency on LLVM. The main benefit of this technique is
fast compilation, reported as orders of magnitudes faster versus
traditional compilation techniques in the paper linked below. The code
produced is slightly less optimized, but suitable for a baseline JIT
compiler.

(Copy-and-patch JIT compiler contributed by Brandt Bucher,
directly inspired by the paper
`Copy-and-Patch Compilation <https://fredrikbk.com/publications/copy-and-patch.pdf>`_
by Haoran Xu and Fredrik Kjolstad. For more information,
`a talk <https://youtu.be/HxSHIpEQRjs?si=RwC78FcXrThIgFmY>`_ is available.)


Results and Future Work
-----------------------

The final performance results will be updated before CPython 3.13's release.

The JIT compiler is rather unoptimized, and serves as the foundation
for significant optimizations in future releases. As such, we do not
expect the first iteration of the JIT compiler to produce a significant
speedup.


About
-----

(JIT by Brandt Bucher, inspired by a paper by Haoran Xu and Fredrik Kjolstad.
Tier 2 IR by Mark Shannon and Guido van Rossum.
Tier 2 optimizer by Ken Jin.)
This work was done by the Faster CPython team, and many other external
contributors. The team consists of engineers from Microsoft, Meta,
Quansight, and Bloomberg, who are either paid in part to do this, or
volunteer in their free time.


Deprecated
Expand Down