-
Notifications
You must be signed in to change notification settings - Fork 52
Superinstructions for Copy & Patch JIT #647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for exploring this, @JeffersGlass! In the interest of getting everyone on the same page, I'll repeat one of my comments from Discord:
As you've noted here, handling this in the JIT introduces quite a bit of complexity, and I'm still sort of leaning towards making superinstructions into their own uops, since we already have the necessary machinery to generate an interpreter loop containing some concatenated uops (and it also benefits the tier two interpreter). We would just need to tweak the tier two instruction format a bit. Quoting you now:
Then let's not allow them to be any depth! Something like 4 should be more than enough to get started. As a quick experiment, I just prototyped a https://godbolt.org/z/7ccP5offd No need to generate any new files. :)
I think this is a great next step. Once we have lists of common pairs, we can start evaluating sequences that are good candidates for combining.
Let's not get too hung up on lookup speed right now, and assume it's a solved problem. There are many potential options available to us (double-lookup, binary search, hash table, etc.).
Don't worry, we have dedicated benchmarking infrastructure to both collect stats and measure performance on a bunch of platforms.
My |
My branch at JeffersGlass/pystats-uop-pairs now has functional tracking of adjacent UOp pairs in executors. The results are also output as part of Pair counts for top 100 uop pairs
The pair Currently, this branch works in the JIT by building a call to a That branch works in the JIT'd case, and I've built in what I think is necessary to run it sans-JIT, but... I realize I don't know how to build/run the Tier 2 interpreter without the JIT. I found the Related - Is there a simple way to run pyperformance locally with pystats enabled? If it seems like this is on the right track at least, I can work on turning it into a PR. I may need a little guidance on code organization - I splashed the new function and its declaration in a little haphazardly. This in addition to figuring out how to make the stats call an "optional" part of the template. After that, I think a script to "score" pairs/sequences of UOps (i.e. how much shorter they are when compiled together vs. separately) could be useful and interesting to build. Thanks again for your support and time, I'm having a swell time getting to know the jit internals better. |
As a brief experiment, I ran (local) pyperformance on both the current 3.13.0a3 build and my experimental jitted version above with the 96 opcodes pairs listed in the previous post. In general, the jitted version is just a little slower, though that's to be expected from the caveats we've been talking about, and the fact that most of these superinstructions probably don't help much. A few specific benchmarks were faster, though Curiously, Stats for
|
Could you please rerun the command with |
Gladly! That is surely easier to read. Results are collapsed below. Looks like overall a ~5% slowdown, with some significant outliers. I would guess the faster ones involve some specific opcode pairs from the list above in a useful way, enough to overcome the overhead of the additional compilation steps/superinstruction lookup. pyperf stats as tableBenchmarks with tag 'apps':
Benchmarks with tag 'asyncio':
Benchmark hidden because not significant (2): async_tree_eager_io, async_tree_eager_io_tg Benchmarks with tag 'math':
Benchmarks with tag 'regex':
Benchmarks with tag 'serialize':
Benchmark hidden because not significant (2): xml_etree_process, pickle Benchmarks with tag 'startup':
Benchmark hidden because not significant (1): python_startup_no_site Benchmarks with tag 'template':
All benchmarks:
Benchmark hidden because not significant (12): bench_thread_pool, xml_etree_process, pickle, scimark_lu, coroutines, telco, create_gc_cycles, typing_runtime_protocols, scimark_sor, async_tree_eager_io, async_tree_eager_io_tg, python_startup_no_site |
I took a preliminary pass at a tool for scoring sequences of UOps. Here's the results for the ~93 most-common-valid-pairs from above, comparing the lengths of the sum of the UOp sequence scores from above (top 93 valid pairs)
This list is using "length of code_body" as a proxy for "speed", which isn't going to be exactly correct. I.e, I don't expect a condensed superinstruction that's 70% the length of the sum of its parts to be 70% faster. But it's a simple heuristic to start with. This is all on X86_64 Linux, using my It's fun that the top pair there, I will get that tool tidied up as well - now that the main JIT branch has merged into main, that's a relatively straightforward PR, if having the tool in main would be generally useful. If not, I can publish it as a separate tool. Also, I know I've gone on a bit of a tear here this week. If this is more spamming than useful, I'm happy to move these ideas/projects elsewhere. |
Wait I just realised if you were benchmarking PS: main just got JIT merged in and it is currently faster without JIT than with JIT, due to micro operations overhead. So that explains your 5% slowdown. Actually if it's only 5% that's a huge improvement. It should be 7-10% slower. Which means your change might have caused a 2% speedup over the current JIT! |
I maintain the benchmarking infrastructure for the team, so happy to answer any questions related to pyperformance etc. @JeffersGlass wrote:
If you have a build with
There's no configure flag in this case, but you need a non-JIT build. The easiest way to run this with pyperformance is with the
Also, if you ever want to have a branch run on the official infrastructure, just ping a member of the Microsoft team on Discord -- it's very easy, but unfortunately we can't make it "self-serve on the open web" for security reasons. |
Thank you @mdboom for the information, it's been very helpful. I will gladly take you and the team up on running some benchmarks on official infrastructure once things stabilize a bit, I'd be curious how much difference running with, say, the top ~1000 most-promising opcode sequences makes. Apologies for not responding sooner - as I mentioned to Brandt, I'm moving house at the moment, and it's significantly cutting into my spare time to dig into this. That said, I am still working on getting pairs/triples/sequences of UOp counts into PyStats. I have a branch (uop-sequence-count) that successfully achieves this (for the Non-JIT tier 2 interpreter only, for now). Only two things remain before I submit it as a PR - making the maximum sequence length adjustable by an environment variable, and re-adding functionality to |
Thanks for working on this. Having stats for tier 2 pairs would be really useful. I don't think we want to implement superinstructions yet, as it will complicate register allocation and generating multiple stencils for instructions like When we do want superinstructions, in the not too far future, having the stats will help a lot. |
Thanks @markshannon, that PR is now live. It includes the ability to track pairs, triples or sequences of any length, but it defaults to only counting pairs. 10-4 on holding off on implementing superinstructions for a bit. With the ability to collect stats, I'd like to continue to play around and see if what results from adding longer chains of superinstructions, and their consequences for performance/size. I'll post results here as they come along. |
Another thing I'd be interested in seeing (and that we may be able to incorporate sooner) is common superinstructions that can be formed without changing the existing instruction format. Meaning, there is at most one each of For example, (A related, but hairier, question would be identifying pairs with at most one unique value for each member. So |
I've been doing a little work around this idea:
Here's what the output could look like, if I understand the requirements correctly (which I may not have) Pair counts for top 100 uop pairs
The conditions I think I understand for whether each UOp uses the three input kinds is:
|
I propose that we automatically generate templates for all permutations within a single macro of uops as well. Here's an example:
Some stuff can be eliminated by guard elimination. A simple heuristic would be
then we should fuse them and make the super instructions The second condition is crucial -- because we know this chain is not speculative, it is guaranteed to occur. From the table above, there is indeed a commonly occurring permutation:
To find the longest chain at runtime, we can automatically generate another abstract interpreter from the macro definition that finds longest matching chains on the first occurence of an instruction it sees, and replaces them. I can work on this part if y'all are keen on the idea. Reasoning:
Currently the optimizer doesn't optimize all instructions. Just as a stat for figurative purposes, only 30% of upper bound The macro with the highest uop count is |
Here's some updated pair counts after rebasing from main. Also some performance data, exploring which subsets of superinstructions might be most valuable. For the moment, this incorporates both "format compatible" superinstructions that don't have overlapping oparg/operand/target, as well as incompatible ones that have overlap. These are the top 500 Uop pairs when running pyperformance, as of where main was on Friday evening: Pair counts for top 500 uop pairsPairs of specialized operations that deoptimize and are then followed by
To assess their potency, I've taken the 326 UOp pairs that account for more that 0.1% of all UOp pairs and run them through the scoring script, which compares the length of the superinstruction in machine instructions to the sum of the lengths of its components. Here are those scores, by both percentage difference and absolute difference:
Scoring for top 326 UOp Pairs (x86_64)
Of course, the specifics of exactly how much shorter a superinstruction is will vary with implementation details and by platform. So, how much shorter is "short enough to be worth it?" Should superinstructions be weighted by their quality and prevalence? Probably some experimentation is needed. And of course there are lots of other ways candidate pairs/sequences could be identified, like @Fidget-Spinner's method above. I tried a couple arbitrary testing points locally: Using the 83 pairs* that reduce the machine code count by 20 or more: ~2% Faster (Local)Benchmarks with tag 'apps':
Benchmark hidden because not significant (2): html5lib, tornado_http Benchmarks with tag 'asyncio':
Benchmark hidden because not significant (4): async_tree_cpu_io_mixed, async_tree_eager, async_tree_none, async_tree_eager_io_tg Benchmarks with tag 'math':
Benchmark hidden because not significant (1): pidigits Benchmarks with tag 'regex':
Benchmarks with tag 'serialize':
Benchmark hidden because not significant (6): unpickle, xml_etree_iterparse, unpickle_list, unpickle_pure_python, xml_etree_generate, xml_etree_parse Benchmarks with tag 'startup':
Benchmark hidden because not significant (1): python_startup Benchmarks with tag 'template':
All benchmarks:
Benchmark hidden because not significant (24): pprint_safe_repr, fannkuch, nqueens, unpickle, gc_traversal, xml_etree_iterparse, python_startup, scimark_sparse_mat_mult, tornado_http, chaos, unpickle_list, unpickle_pure_python, async_tree_cpu_io_mixed, logging_silent, pprint_pformat, async_tree_eager, xml_etree_generate, pidigits, async_tree_none, html5lib, mdp, async_tree_eager_io_tg, xml_etree_parse, bench_thread_pool Using the 251 pairs* that reduce the machine code count by 13 or more: ~4% Faster (Local)Benchmarks with tag 'apps':
Benchmark hidden because not significant (3): chameleon, html5lib, tornado_http Benchmarks with tag 'asyncio':
Benchmark hidden because not significant (4): async_tree_none, async_tree_eager_io, async_tree_none_tg, async_tree_eager_io_tg Benchmarks with tag 'math':
Benchmarks with tag 'regex':
Benchmark hidden because not significant (1): regex_effbot Benchmarks with tag 'serialize':
Benchmark hidden because not significant (2): unpickle_list, json_loads Benchmarks with tag 'startup':
Benchmarks with tag 'template':
All benchmarks:
Benchmark hidden because not significant (16): bench_mp_pool, async_tree_none, tornado_http, typing_runtime_protocols, chameleon, unpickle_list, json_loads, asyncio_tcp, scimark_sor, regex_effbot, async_tree_eager_io, async_tree_none_tg, mdp, html5lib, async_tree_eager_io_tg, bench_thread_pool *Not including pairs that include |
I thought I'd share the an update on the Superinstruction experiments. I've started from scratch in this branch by teaching the bytecode analyzer to understand superinstructions: // supernodes.c
super() = _LOAD_FAST_1 + _GUARD_BOTH_INT
super() = _TIER2_RESUME_CHECK + _SET_IP
super() = _BINARY_OP_ADD_INT + _LOAD_CONST_INLINE_BORROW Which allows it to calculate and emit metadata and IDs the same way it handles bytecodes: [_LOAD_FAST_1_PLUS__GUARD_BOTH_INT] = HAS_LOCAL_FLAG | HAS_EXIT_FLAG,
[_TIER2_RESUME_CHECK_PLUS__SET_IP] = HAS_DEOPT_FLAG | HAS_OPERAND_FLAG,
[_BINARY_OP_ADD_INT_PLUS__LOAD_CONST_INLINE_BORROW] = HAS_ERROR_FLAG | HAS_PURE_FLAG | HAS_OPERAND_FLAG,
...
#define _WITH_EXCEPT_START WITH_EXCEPT_START
#define _YIELD_VALUE YIELD_VALUE
#define MAX_VANILLA_UOP_ID 451
#define _LOAD_FAST_1_PLUS__GUARD_BOTH_INT 452
#define _TIER2_RESUME_CHECK_PLUS__SET_IP 453
#define _BINARY_OP_ADD_INT_PLUS__LOAD_CONST_INLINE_BORROW 454
#define MAX_UOP_ID 454 We generate a switch statement in a similar way to previous experiments (in //jit_switch.c
SuperNode
_JIT_INDEX(const _PyUOpInstruction *uops, uint16_t start_index) {
switch (uops[start_index + 0].opcode) {
case _LOAD_FAST_1:
switch (uops[start_index + 1].opcode) {
case _GUARD_BOTH_INT:
return (SuperNode) {.index = _LOAD_FAST_1_PLUS__GUARD_BOTH_INT, .length = 2};
break;
... The biggest new experiment is automatically iterating on sets of supernodes. Tools/scripts/supernode_analysis.py contains tools for analyzing a set of pystats and deriving the a new set of supernodes from the given data. For instance: # Run (up to) 5 generations of build/run-pystats/derive-new-supernodes, using 4 threads for builds, verbose=1, running only the docutils benchmark
$ python Tools/scripts/supernode_analysis.py iterate -v -i5 -j4 -b docutils
Beginning supernode generation process for 10 iterations max
Starting supernode generation 1 of 10
Generating statistics
Updating supernode metadata and building JIT
Generating supernodes from stats
Added 187 of 1224 possible supernodes that make up more than 0.1% of nodes and are viable
Updating supernode metadata and building JIT
Added 128 supernodes
Starting supernode generation 2 of 10
Generating statistics
Stat-ing python with 128 of 128 nodes
Updating supernode metadata and building JIT
...
There are still some bugs floating around in superinstruction construction/usage (and possibly in Tier 2 itself?), so this script will, by default, detect errors during Python builds and during the pystats runs, bisect to find the troublesome superinstructions, and remove them from the run: ...
Stat-ing python with 28 of 28 nodes
Updating supernodes.c
Updating supernode metadata and building JIT
Stat FAILED, bisecting
Stat-ing python with 14 of 28 nodes
Updating supernodes.c
...
Identified bad node during stat: _START_EXECUTOR_PLUS__POP_TOP
Building Python with 27 nodes
... There's much more to be done - I'm tracking granular todos in an issue on my fork, but some big areas of investigation:
It was a pleasure to meet so many in this thread at PyConUS in May - thanks for conversations during the open spaces and sprints. I hope these experiments prove useful - I will share more observations and results as they pop up. |
FWIW, I gave another short at super instructions. With the newest JIT, it shows no speedup on my computer https://github.com/Fidget-Spinner/cpython/pull/new/Fidget-Spinner:cpython:supernodes. Even on small microbenchmarks (e.g. iterative fibonacci). This branch's super instructions support up to 7 instructions (14 operands!). (Last working commit: d95dcdd55106f7f228ac8c84a979d52cfeb4578b) I suspect most of the previous wins was from removing the zero-length jumps. Since the JIT is now emitting more efficient code, I don't think we need this anymore. |
I tried a true "baseline" JIT: that is, turning tier 1 bytecode directly to JIT stencils where possible. https://github.com/Fidget-Spinner/cpython/pull/new/Fidget-Spinner:cpython:tier1_baseline This should be the a really strong case for superinstructions. However, there's almost no speedup there too on iterative fibonacci. |
Uh oh!
There was an error while loading. Please reload this page.
Inspired by @brandtbucher's recorded talk from the CPython core sprint, and his work in the Copy & Patch JIT PR, I've worked up version of that JIT that allows for 'superinstructions'. That is, pairs/triples/sequences of instructions that are pre-compiled into stencils, the same way single-UOps are in the current PR.
The branch at JeffersGlass/cpython/tree/justin-supernodes shows this in action. If you build that branch with
./configure --enable-experimentaljit
,make
, all of the opcode sequences listed inTools/jit/superinstructions.csv
will be built into stencils, and made available to optimizer to JIT-compile with.I refer to the length of the longest sequence of UOps in a single superinstruction as the "depth" of the superinstruction set. Much of the complexity of this branch stems from the desire to allow the builder to input sequences of any depth, and simply accomodate them in the build process.
Key Changes
template.c
files that serve as the basis for creating the JIT stencils must be created. This is handled in_template.py
, at build-time.jit.c
is constructed at build-time by_jic_c.py
, using_jit_template.c
as a template.jit.c
also includes a new function,_JIT_INDEX
, which uses a nested switch statement generated from the superinstructions list to select the correct superinstruction (if any) from the upcoming UOpsjit_defines.h
file is also emitted at build time, with indices for the new superinstructions and some other utilility data (MAX_SUPERINST_ID
)_stencils.HoleValue
Enum is created dynamically after the depth is knownPlaces for Improvment
Better Superinstruction Choices
The version of
superinstructions.csv
is, at the moment, a smattering of short sequences that popped up during testing. It's not vetted, and most of those combinations may not even been significantly shorter than just JIT-ing their components individually.Brandt suggested that adding instrumentation to
--enable-pystats
to log adjacent op pairs for tier two, like is currently done for tier 1. That's a challenge I would be interested in taking on, time permitting.Better Superinstruction Selection at JIT-Time
As you'll see in the built
jit.c:_JIT_INDEX()
, the way that the optimizer selects which op or superinstruction to emit is via a giant nested switch statement, which I'm counting on the compiler to 'cleverly' turn into something more efficient and compact.There's surely a better way to do that matching - the XU/Kjolstad Paper mentions a "tree-matching" technique, but I couldn't track it done quickly in either of their reference projects. Or perhaps a windowed lookup of some kind.
Benchmarking
I haven't done any benchmarking with the paltry 7 superinstructions currently used, since I don't actually expect it to be faster (yet?).
Cross-Compilation
This is only tested on X86_64 Linux, as that's what I have access to / a build environment set up for. I'd be really curious it it works elsewhere.
This was mostly an experiment for my own edification, and to become more familiar with the new JIT/UOp internals. I hope some of it is useful and interesting.
Thanks to Brandt for his welcoming energy in the Python discord, and for answering my questions.
The text was updated successfully, but these errors were encountered: