diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index aa116678b19..5c010e8027f 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -622,6 +622,7 @@ peps/pep-0740.rst @dstufft peps/pep-0741.rst @vstinner peps/pep-0742.rst @JelleZijlstra peps/pep-0743.rst @vstinner +peps/pep-0744.rst @brandtbucher # ... # peps/pep-0754.rst # ... diff --git a/peps/pep-0744.rst b/peps/pep-0744.rst new file mode 100644 index 00000000000..3ea5ffb308d --- /dev/null +++ b/peps/pep-0744.rst @@ -0,0 +1,568 @@ +PEP: 744 +Title: JIT Compilation +Author: Brandt Bucher +Status: Draft +Type: Informational +Created: 11-Apr-2024 +Python-Version: 3.13 + +Abstract +======== + +Earlier this year, an `experimental "just-in-time" compiler +`_ was merged into CPython's +``main`` development branch. While recent CPython releases have included other +substantial internal changes, this addition represents a particularly +significant departure from the way CPython has traditionally executed Python +code. As such, it deserves wider discussion. + +This PEP aims to summarize the design decisions behind this addition, the +current state of the implementation, and future plans for making the JIT a +permanent, non-experimental part of CPython. It does *not* seek to provide a +comprehensive overview of *how* the JIT works, instead focusing on the +particular advantages and disadvantages of the chosen approach, as well as +answering many questions that have been asked about the JIT since its +introduction. + +Readers interested in learning more about the new JIT are encouraged to consult +the following resources: + +- The `presentation `_ which first introduced the + JIT at the 2023 CPython Core Developer Sprint. It includes relevant + background, a light technical introduction to the "copy-and-patch" technique + used, and an open discussion of its design amongst the core developers + present. + +- The `open access paper `_ originally + describing copy-and-patch. + +- The `blog post `_ by the + paper's author detailing the implementation of a copy-and-patch JIT compiler + for Lua. While this is a great low-level explanation of the approach, note + that it also incorporates other techniques and makes implementation decisions + that are not particularly relevant to CPython's JIT. + +- The `implementation <#reference-implementation>`_ itself. + +Motivation +========== + +Until this point, CPython has always executed Python code by compiling it to +bytecode, which is interpreted at runtime. This bytecode is a more-or-less +direct translation of the source code: it is untyped, and largely unoptimized. + +Since the Python 3.11 release, CPython has used a "specializing adaptive +interpreter" (:pep:`659`), which `rewrites these bytecode instructions in-place +`_ with type-specialized versions as they run. +This new interpreter delivers significant performance improvements, despite the +fact that its optimization potential is limited by the boundaries of individual +bytecode instructions. It also collects a wealth of new profiling information: +the types flowing though a program, the memory layout of particular objects, and +what paths through the program are being executed the most. In other words, +*what* to optimize, and *how* to optimize it. + +Since the Python 3.12 release, CPython has generated this interpreter from a +`C-like domain-specific language +`_ (DSL). In +addition to taming some of the complexity of the new adaptive interpreter, the +DSL also allows CPython's maintainers to avoid hand-writing tedious boilerplate +code in many parts of the interpreter, compiler, and standard library that must +be kept in sync with the instruction definitions. This ability to generate large +amounts of runtime infrastructure from a single source of truth is not only +convenient for maintenance; it also unlocks many possibilities for expanding +CPython's execution in new ways. For instance, it makes it feasible to +automatically generate tables for translating a sequence of instructions into an +equivalent sequence of smaller "micro-ops", generate an optimizer for sequences +of these micro-ops, and even generate an entire second interpreter for executing +them. + +In fact, since early in the Python 3.13 release cycle, all CPython builds have +included this exact micro-op translation, optimization, and execution machinery. +However, it is disabled by default; the overhead of interpreting even optimized +traces of micro-ops is just too large for most code. Heavier optimization +probably won't improve the situation much either, since any efficiency gains +made by new optimizations will likely be offset by the interpretive overhead of +even smaller, more complex micro-ops. + +The most obvious strategy to overcome this new bottleneck is to statically +compile these optimized traces. This presents opportunities to avoid several +sources of indirection and overhead introduced by interpretation. In particular, +it allows the removal of dispatch overhead between micro-ops (by replacing a +generic interpreter with a straight-line sequence of hot code), instruction +decoding overhead for individual micro-ops (by "burning" the values or addresses +of arguments, constants, and cached values directly into machine instructions), +and memory traffic (by moving data off of heap-allocated Python frames and into +physical hardware registers). + +Since much of this data varies even between identical runs of a program and the +existing optimization pipeline makes heavy use of runtime profiling information, +it doesn't make much sense to compile these traces ahead of time. As has been +demonstrated for many other dynamic languages (`and even Python itself +`_), the most promising approach is to compile the +optimized micro-ops "just in time" for execution. + +Rationale +========= + +Despite their reputation, JIT compilers are not magic "go faster" machines. +Developing and maintaining any sort of optimizing compiler for even a single +platform, let alone all of CPython's most popular supported platforms, is an +incredibly complicated, expensive task. Using an existing compiler framework +like LLVM can make this task simpler, but only at the cost of introducing heavy +runtime dependencies and significantly higher JIT compilation overhead. + +It's clear that successfully compiling Python code at runtime requires not only +high-quality Python-specific optimizations for the code being run, *but also* +quick generation of efficient machine code for the optimized program. The Python +core development team has the necessary skills and experience for the former (a +middle-end tightly coupled to the interpreter), and copy-and-patch compilation +provides an attractive solution for the latter. + +In a nutshell, copy-and-patch allows a high-quality template JIT compiler to be +generated from the same DSL used to generate the rest of the interpreter. For a +widely-used, volunteer-driven project like CPython, this benefit cannot be +overstated: CPython's maintainers, by merely editing the bytecode definitions, +will also get the JIT backend updated "for free", for *all* JIT-supported +platforms, at once. This is equally true whether instructions are being added, +modified, or removed. + +Like the rest of the interpreter, the JIT compiler is generated at build time, +and has no runtime dependencies. It supports a wide range of platforms (see the +`Support`_ section below), and has comparatively low maintenance burden. In all, +the current implementation is made up of about 900 lines of build-time Python +code and 500 lines of runtime C code. + +Specification +============= + +The JIT will become non-experimental once all of the following conditions are +met: + +#. It provides a meaningful performance improvement for at least one popular + platform (realistically, on the order of 5%). + +#. It can be built, distributed, and deployed with minimal disruption. + +#. The Steering Council, upon request, has determined that it would provide more + value to the community if enabled than if disabled (considering tradeoffs + such as maintenance burden, memory usage, or the feasibility of alternate + designs). + +These criteria should be considered a starting point, and may be expanded over +time. For example, discussion of this PEP may reveal that additional +requirements (such as multiple committed maintainers, a security audit, +documentation in the devguide, support for out-of-process debugging, or a +runtime option to disable the JIT) should be added to this list. + +Until the JIT is non-experimental, it should *not* be used in production, and +may be broken or removed at any time without warning. + +Once the JIT is no longer experimental, it should be treated in much the same +way as other build options such as ``--enable-optimizations`` or ``--with-lto``. +It may be a recommended (or even default) option for some platforms, and release +managers *may* choose to enable it in official releases. + +Support +------- + +The JIT has been developed for all of :pep:`11`'s current tier one platforms, +most of its tier two platforms, and one of its tier three platforms. +Specifically, CPython's ``main`` branch has `CI +`_ +building and testing the JIT for both release and debug builds on: + +- ``aarch64-apple-darwin/clang`` + +- ``aarch64-pc-windows/msvc`` [#untested]_ + +- ``aarch64-unknown-linux-gnu/clang`` [#emulated]_ + +- ``aarch64-unknown-linux-gnu/gcc`` [#emulated]_ + +- ``i686-pc-windows-msvc/msvc`` + +- ``x86_64-apple-darwin/clang`` + +- ``x86_64-pc-windows-msvc/msvc`` + +- ``x86_64-unknown-linux-gnu/clang`` + +- ``x86_64-unknown-linux-gnu/gcc`` + +It's worth noting that some platforms, even future tier one platforms, may never +gain JIT support. This can be for a variety of reasons, including insufficient +LLVM support (``powerpc64le-unknown-linux-gnu/gcc``), inherent limitations of +the platform (``wasm32-unknown-wasi/clang``), or lack of developer interest +(``x86_64-unknown-freebsd/clang``). + +Once JIT support for a platform is added (meaning, the JIT builds successfully +without displaying warnings to the user), it should be treated in much the same +way as :pep:`11` prescribes: it should have reliable CI/buildbots, and JIT +failures on tier one and tier two platforms should block releases. Though it's +not necessary to update :pep:`11` to specify JIT support, it may be helpful to +do so anyway. Otherwise, a list of supported platforms should be maintained in +`the JIT's README +`_. + +Since it should always be possible to build CPython without the JIT, removing +JIT support for a platform should *not* be considered a backwards-incompatible +change. However, if it is reasonable to do so, the normal deprecation process +should be followed as outlined in :pep:`387`. + +The JIT's build-time dependencies may be changed between releases, within +reason. + +Backwards Compatibility +======================= + +Due to the fact that the current interpreter and the JIT backend are both +generated from the same specification, the behavior of Python code should be +completely unchanged. In practice, observable differences that have been found +and fixed during testing have tended to be bugs in the existing micro-op +translation and optimization stages, rather than bugs in the copy-and-patch +step. + +Debugging +--------- + +Tools that profile and debug Python code will continue to work fine. This +includes in-process tools that use Python-provided functionality (like +``sys.monitoring``, ``sys.settrace``, or ``sys.setprofile``), as well as +out-of-process tools that walk Python frames from the interpreter state. + +However, it appears that profilers and debuggers *for C code* are currently +unable to trace back through JIT frames. Working with leaf frames is possible +(this is how the JIT itself is debugged), though it is of limited utility due to +the absence of proper debugging information for JIT frames. + +Since the code templates emitted by the JIT are compiled by Clang, it *may* be +possible to allow JIT frames to be traced through by simply modifying the +compiler flags to use frame pointers more carefully. It may also be possible to +harvest and emit the debugging information produced by Clang. Neither of these +ideas have been explored very deeply. + +While this is an issue that *should* be fixed, fixing it is not a particularly +high priority at this time. This is probably a problem best explored by somebody +with more domain expertise in collaboration with those maintaining the JIT, who +have little experience with the inner workings of these tools. + +Security Implications +===================== + +This JIT, like any JIT, produces large amounts of executable data at runtime. +This introduces a potential new attack surface to CPython, since a malicious +actor capable of influencing the contents of this data is therefore capable of +executing arbitrary code. This is a `well-known vulnerability +`_ of JIT +compilers. + +In order to mitigate this risk, the JIT has been written with best practices in +mind. In particular, the data in question is not exposed by the JIT compiler to +other parts of the program while it remains writable, and at *no* point is the +data both |wx|_. + +.. Apparently this how you hack together a formatted link: + +.. |wx| replace:: writable *and* executable +.. _wx: https://en.wikipedia.org/wiki/W%5EX + +The nature of template-based JITs also seriously limits the kinds of code that +can be generated, further reducing the likelihood of a successful exploit. As an +additional precaution, the templates themselves are stored in static, read-only +memory. + +However, it would be naive to assume that no possible vulnerabilities exist in +the JIT, especially at this early stage. The author is not a security expert, +but is available to join or work closely with the Python Security Response Team +to triage and fix security issues as they arise. + +Apple Silicon +-------------- + +Though difficult to test without actually signing and packaging a macOS release, +it *appears* that macOS releases should `enable the JIT Entitlement for the +Hardened Runtime +`_. + +This shouldn't make *installing* Python any harder, but may add additional steps +for release managers to perform. + +How to Teach This +================= + +Choose the sections that best describe you: + +- **If you are a Python programmer or end user...** + + - ...nothing changes for you. Nobody should be distributing JIT-enabled + CPython interpreters to you while it is still an experimental feature. Once + it is non-experimental, you will probably notice slightly better performance + and slightly higher memory usage. You shouldn't be able to observe any other + changes. + +- **If you maintain third-party packages...** + + - ...nothing changes for you. There are no API or ABI changes, and the JIT is + not exposed to third-party code. You shouldn't need to change your CI + matrix, and you shouldn't be able to observe differences in the way your + packages work when the JIT is enabled. + +- **If you profile or debug Python code...** + + - ...nothing changes for you. All Python profiling and tracing functionality + remains. + +- **If you profile or debug C code...** + + - ...currently, the ability to trace *through* JIT frames is limited. This may + cause issues if you need to observe the entire C call stack, rather than + just "leaf" frames. See the `Debugging`_ section above for more information. + +- **If you compile your own Python interpreter....** + + - ...if you don't wish to build the JIT, you can simply ignore it. Otherwise, + you will need to `install a compatible version of LLVM + `_, and + pass the appropriate flag to the build scripts. Your build may take up to a + minute longer. Note that the JIT should *not* be distributed to end users or + used in production while it is still in the experimental phase. + +- **If you're a maintainer of CPython (or a fork of CPython)...** + + - **...and you change the bytecode definitions or the main interpreter + loop...** + + - ...in general, the JIT shouldn't be much of an inconvenience to you + (depending on what you're trying to do). The micro-op interpreter isn't + going anywhere, and still offers a debugging experience similer to what + the main bytecode interpreter provides today. There is moderate likelihood + that larger changes to the interpreter (such as adding new local + variables, changing error handling and deoptimization logic, or changing + the micro-op format) will require changes to the C template used to + generate the JIT, which is meant to mimic the main interpreter loop. You + may also occasionally just get unlucky and break JIT code generation, + which will require you to either modify the Python build scripts yourself, + or solicit the help of somebody more familiar with them (see below). + + - **...and you work on the JIT itself...** + + - ...you hopefully already have a decent idea of what you're getting + yourself into. You will be regularly modifying the Python build scripts, + the C template used to generate the JIT, and the C code that actually + makes up the runtime portion of the JIT. You will also be dealing with + all sorts of crashes, stepping over machine code in a debugger, staring at + COFF/ELF/Mach-O dumps, developing on a wide range of platforms, and + generally being the point of contact for the people changing the bytecode + when CI starts failing on their PRs (see above). Ideally, you're at least + *familiar* with assembly, have taken a couple of courses with "compilers" + in their name, and have read a blog post or two about linkers. + + - **...and you maintain other parts of CPython...** + + - ...nothing changes for you. You shouldn't need to develop locally with JIT + builds. If you choose to do so (for example, to help reproduce and triage + JIT issues), your builds may take up to a minute longer each time the + relevant files are modified. + +Reference Implementation +======================== + +Key parts of the implementation include: + +- |readme|_: Instructions for how to build the JIT. + +- |jit|_: The entire runtime portion of the JIT compiler. + +- |jit_stencils|_: An example of the JIT's generated templates. + +- |template|_: The code which is compiled to produce the JIT's templates. + +- |targets|_: The code to compile and parse the templates at build time. + +.. |readme| replace:: ``Tools/jit/README.md`` +.. _readme: https://github.com/python/cpython/blob/main/Tools/jit/README.md + +.. |jit| replace:: ``Python/jit.c`` +.. _jit: https://github.com/python/cpython/blob/main/Python/jit.c + +.. |jit_stencils| replace:: ``jit_stencils.h`` +.. _jit_stencils: https://gist.github.com/brandtbucher/9d3cc396dcb15d13f7e971175e987f3a + +.. |template| replace:: ``Tools/jit/template.c`` +.. _template: https://github.com/python/cpython/blob/main/Tools/jit/template.c + +.. |targets| replace:: ``Tools/jit/_targets.py`` +.. _targets: https://github.com/python/cpython/blob/main/Tools/jit/_targets.py + +Rejected Ideas +============== + +Maintain it outside of CPython +------------------------------ + +While it is *probably* possible to maintain the JIT outside of CPython, its +implementation is tied tightly enough to the rest of the interpreter that +keeping it up-to-date would probably be more difficult than actually developing +the JIT itself. Additionally, contributors working on the existing micro-op +definitions and optimizations would need to modify and build two separate +projects to measure the effects of their changes under the JIT (whereas today, +infrastructure exists to do this automatically for any proposed change). + +Releases of the separate "JIT" project would probably also need to correspond to +specific CPython pre-releases and patch releases, depending on exactly what +changes are present. Individual CPython commits between releases likely wouldn't +have corresponding JIT releases at all, further complicating debugging efforts +(such as bisection to find breaking changes upstream). + +Since the JIT is already quite stable, and the ultimate goal is for it to be a +non-experimental part of CPython, keeping it in ``main`` seems to be the best +path forward. With that said, the relevant code is organized in such a way that +the JIT can be easily "deleted" if it does not end up meeting its goals. + +Turn it on by default +--------------------- + +On the other hand, some have suggested that the JIT should be enabled by default +in its current form. + +Again, it is important to remember that a JIT is not a magic "go faster" +machine; currently, the JIT is about as fast as the existing specializing +interpreter. This may sound underwhelming, but it is actually a fairly +significant achievement, and it's the main reason why this approach was +considered viable enough to be merged into ``main`` for further development. + +While the JIT provides significant gains over the existing micro-op interpreter, +it isn't yet a clear win when always enabled (especially considering its +increased memory consumption and additional build-time dependencies). That's the +purpose of this PEP: to clarify expectations about the objective criteria that +should be met in order to "flip the switch". + +At least for now, having this in ``main``, but off by default, seems to be a +good compromise between always turning it on and not having it available at all. + +Support multiple compiler toolchains +------------------------------------ + +Clang is specifically needed because it's the only C compiler with support for +guaranteed tail calls (|musttail|_), which are required by CPython's +`continuation-passing-style +`_ approach +to JIT compilation. Without it, the tail-recursive calls between templates could +result in unbounded C stack growth (and eventual overflow). + +.. |musttail| replace:: ``musttail`` +.. _musttail: https://clang.llvm.org/docs/AttributeReference.html#musttail + +Since LLVM also includes other functionalities required by the JIT build process +(namely, utilities for object file parsing and disassembly), and additional +toolchains introduce additional testing and maintenance burden, it's convenient +to only support one major version of one toolchain at this time. + +Compile the base interpreter's bytecode +--------------------------------------- + +Most of the prior art for copy-and-patch uses it as a fast baseline JIT, whereas +CPython's JIT is using the technique to compile optimized micro-op traces. + +In practice, the new JIT currently sits somewhere between the "baseline" and +"optimizing" compiler tiers of other dynamic language runtimes. This is because +CPython uses its specializing adaptive interpreter to collect runtime profiling +information, which is used to detect and optimize "hot" paths through the code. +This step is carried out using self-modifying code, a technique which is much +more difficult to implement with a JIT compiler. + +While it's *possible* to compile normal bytecode using copy-and-patch (in fact, +early prototypes predated the micro-op interpreter and did exactly this), it +just doesn't seem to provide enough optimization potential as the more granular +micro-op format. + +Add GPU support +--------------- + +The JIT is currently CPU-only. It does not, for example, offload NumPy array +computations to CUDA GPUs, as JITs like `Numba +`_ do. + +There is already a rich ecosystem of tools for accelerating these sorts of +specialized tasks, and CPython's JIT is not intended to replace them. Instead, +it is meant to improve the performance of general-purpose Python code, which is +less likely to benefit from deeper GPU integration. + +Open Issues +=========== + +Speed +----- + +Currently, the JIT is `about as fast as the existing specializing interpreter +`_ +on most platforms. Improving this is obviously a top priority at this point, +since providing a significant performance gain is the entire motivation for +having a JIT at all. A number of proposed improvements are already underway, and +this ongoing work is being tracked in `GH-115802 +`_. + +Memory +------ + +Because it allocates additional memory for executable machine code, the JIT does +use more memory than the existing interpreter at runtime. According to the +official benchmarks, the JIT currently uses about `10-20% more memory than the +base interpreter +`_. +The upper end of this range is due to ``aarch64-apple-darwin``, which has larger +page sizes (and thus, a larger minimum allocation granularity). + +However, these numbers should be taken with a grain of salt, as the benchmarks +themselves don't actually have a very high baseline of memory usage. Since they +have a higher ratio of code to data, the JIT's memory overhead is more +pronounced than it would be in a typical workload where memory pressure is more +likely to be a real concern. + +Not much effort has been put into optimizing the JIT's memory usage yet, so +these numbers likely represent a maximum that will be reduced over time. +Improving this is a medium priority, and is being tracked in `GH-116017 +`_. + +Earlier versions of the JIT had a more complicated memory allocation scheme +which imposed a number of fragile limitations on the size and layout of the +emitted code, and significantly bloated the memory footprint of Python +executable. These issues are no longer present in the current design. + +Dependencies +------------ + +Building the JIT adds between 3 and 60 seconds to the build process, depending +on platform. It is only rebuilt whenever the generated files become out-of-date, +so only those who are actively developing the main interpreter loop will be +rebuilding it with any frequency. + +Unlike many other generated files in CPython, the JIT's generated files are not +tracked by Git. This is because they contain compiled binary code templates +specific to not only the host platform, but also the current build configuration +for that platform. As such, hosting them would require a significant engineering +effort in order to build and host dozens of large binary files for each commit +that changes the generated code. While perhaps feasible, this is not a priority, +since installing the required tools is not prohibitively difficult for most +people building CPython, and the build step is not particularly time-consuming. + +Since some still remain interested in this possibility, discussion is being +tracked in `GH-115869 `_. + +Footnotes +========= + +.. [#untested] Due to lack of available hardware, the JIT is built, but not + tested, for this platform. + +.. [#emulated] Due to lack of available hardware, the JIT is built using + cross-compilation and tested using hardware emulation for this platform. Some + tests are skipped because emulation causes them to fail. However, the JIT has + been successfully built and tested for this platform on non-emulated + hardware. + +Copyright +========= + +This document is placed in the public domain or under the CC0-1.0-Universal +license, whichever is more permissive.