Skip to content

Superinstructions for Copy & Patch JIT #647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JeffersGlass opened this issue Jan 25, 2024 · 19 comments
Open

Superinstructions for Copy & Patch JIT #647

JeffersGlass opened this issue Jan 25, 2024 · 19 comments

Comments

@JeffersGlass
Copy link
Contributor

JeffersGlass commented Jan 25, 2024

Inspired by @brandtbucher's recorded talk from the CPython core sprint, and his work in the Copy & Patch JIT PR, I've worked up version of that JIT that allows for 'superinstructions'. That is, pairs/triples/sequences of instructions that are pre-compiled into stencils, the same way single-UOps are in the current PR.

The branch at JeffersGlass/cpython/tree/justin-supernodes shows this in action. If you build that branch with ./configure --enable-experimentaljit, make, all of the opcode sequences listed in Tools/jit/superinstructions.csv will be built into stencils, and made available to optimizer to JIT-compile with.

I refer to the length of the longest sequence of UOps in a single superinstruction as the "depth" of the superinstruction set. Much of the complexity of this branch stems from the desire to allow the builder to input sequences of any depth, and simply accomodate them in the build process.

Key Changes

  • Multiple template.c files that serve as the basis for creating the JIT stencils must be created. This is handled in _template.py, at build-time.
  • jit.c is constructed at build-time by _jic_c.py, using _jit_template.c as a template.
    • jit.c also includes a new function, _JIT_INDEX, which uses a nested switch statement generated from the superinstructions list to select the correct superinstruction (if any) from the upcoming UOps
  • A new jit_defines.h file is also emitted at build time, with indices for the new superinstructions and some other utilility data (MAX_SUPERINST_ID)
  • The _stencils.HoleValue Enum is created dynamically after the depth is known

Places for Improvment

Better Superinstruction Choices

The version of superinstructions.csv is, at the moment, a smattering of short sequences that popped up during testing. It's not vetted, and most of those combinations may not even been significantly shorter than just JIT-ing their components individually.

Brandt suggested that adding instrumentation to --enable-pystats to log adjacent op pairs for tier two, like is currently done for tier 1. That's a challenge I would be interested in taking on, time permitting.

Better Superinstruction Selection at JIT-Time

As you'll see in the built jit.c:_JIT_INDEX(), the way that the optimizer selects which op or superinstruction to emit is via a giant nested switch statement, which I'm counting on the compiler to 'cleverly' turn into something more efficient and compact.

There's surely a better way to do that matching - the XU/Kjolstad Paper mentions a "tree-matching" technique, but I couldn't track it done quickly in either of their reference projects. Or perhaps a windowed lookup of some kind.

Benchmarking

I haven't done any benchmarking with the paltry 7 superinstructions currently used, since I don't actually expect it to be faster (yet?).

Cross-Compilation

This is only tested on X86_64 Linux, as that's what I have access to / a build environment set up for. I'd be really curious it it works elsewhere.


This was mostly an experiment for my own edification, and to become more familiar with the new JIT/UOp internals. I hope some of it is useful and interesting.

Thanks to Brandt for his welcoming energy in the Python discord, and for answering my questions.

@brandtbucher
Copy link
Member

brandtbucher commented Jan 26, 2024

Thanks for exploring this, @JeffersGlass! In the interest of getting everyone on the same page, I'll repeat one of my comments from Discord:

FWIW, the best pairs will probably be the ones that compile to code that is much better than just their concatenated parts. _GUARD_BOTH_INT and _BINARY_OP_ADD_INT are definitely a common pair (at least right now), but there's not too much "clever" stuff that can happen by letting LLVM see them together. A better example would be something like _LOAD_FAST + _TO_BOOL_BOOL + _GUARD_IS_TRUE_POP, where LLVM could in theory elide the pushes, pops, and refcounts entirely.

Also, I'm still on the fence of whether it makes more sense to do superinstructions in the JIT like you have (and I also prototyped a while back), or just make them their own uops that are created during the translation pass. The latter makes things a lot cleaner for the JIT, since they look like just any other uop... at the expense of needing to handle muliple opargs/operands/targets/etc. for "single" tier 2 instructions everywhere.

As you've noted here, handling this in the JIT introduces quite a bit of complexity, and I'm still sort of leaning towards making superinstructions into their own uops, since we already have the necessary machinery to generate an interpreter loop containing some concatenated uops (and it also benefits the tier two interpreter). We would just need to tweak the tier two instruction format a bit.

Quoting you now:

Much of the complexity of this branch stems from the desire to allow the builder to input sequences of any depth, and simply accomodate them in the build process.

Then let's not allow them to be any depth! Something like 4 should be more than enough to get started.

As a quick experiment, I just prototyped a template.c style file that handles superinstructions of any length <= 4. It looks like Clang is happy to unroll the loop for us (uncomment one of the _JIT_OPCODES defines at the top to see):

https://godbolt.org/z/7ccP5offd

No need to generate any new files. :)

Brand suggested that adding instrumentation to --enable-pystats to log adjacent op pairs for tier two, like is currently done for tier 1. That's a challenge I would be interested in taking on, time permitting.

I think this is a great next step. Once we have lists of common pairs, we can start evaluating sequences that are good candidates for combining.

There's surely a better way to do that matching - the XU/Kjolstad Paper mentions a "tree-matching" technique, but I couldn't track it done quickly in either of their reference projects. Or perhaps a windowed lookup of some kind.

Let's not get too hung up on lookup speed right now, and assume it's a solved problem. There are many potential options available to us (double-lookup, binary search, hash table, etc.).

I haven't done any benchmarking with the paltry 7 superinstructions currently used, since I don't actually expect it to be faster (yet?).

Don't worry, we have dedicated benchmarking infrastructure to both collect stats and measure performance on a bunch of platforms.

This is only tested on X86_64 Linux, as that's what I have access to / a build environment set up for. I'd be really curious it it works elsewhere.

My justin branch has CI in a file called jit.yml. That will run everything on 7 different platforms (basically everything except AArch64 macOS, which will at least give you the confidence that it passes all of the tests).

@JeffersGlass
Copy link
Contributor Author

JeffersGlass commented Jan 28, 2024

My branch at JeffersGlass/pystats-uop-pairs now has functional tracking of adjacent UOp pairs in executors. The results are also output as part of summarize_stats.py. Here's some sample output:

Pair counts for top 100 uop pairs
Pair Count Self Cumulative
_LOAD_FAST _SET_IP 711,096 6.2% 6.2%
_LOAD_FAST _LOAD_FAST 361,358 3.2% 9.4%
_STORE_FAST _LOAD_FAST 308,227 2.7% 12.0%
_CHECK_VALIDITY _LOAD_FAST 274,202 2.4% 14.4%
_GUARD_GLOBALS_VERSION _GUARD_BUILTINS_VERSION 251,876 2.2% 16.6%
_GUARD_BUILTINS_VERSION _LOAD_GLOBAL_BUILTINS 251,747 2.2% 18.8%
_SET_IP _GUARD_TYPE_VERSION 239,154 2.1% 20.9%
_SET_IP _CHECK_VALIDITY 227,690 2.0% 22.9%
_GUARD_GLOBALS_VERSION _LOAD_GLOBAL_MODULE 197,805 1.7% 24.6%
_CHECK_VALIDITY _STORE_FAST 197,573 1.7% 26.4%
_LOAD_CONST_INLINE_BORROW _SET_IP 197,128 1.7% 28.1%
_GUARD_TYPE_VERSION _LOAD_ATTR_METHOD_NO_DICT 194,127 1.7% 29.8%
_CHECK_VALIDITY _GUARD_IS_FALSE_POP 187,039 1.6% 31.4%
_LOAD_FAST _LOAD_CONST_INLINE_BORROW 185,910 1.6% 33.0%
_LOAD_FAST _GUARD_GLOBALS_VERSION 156,703 1.4% 34.4%
_LOAD_GLOBAL_BUILTINS _LOAD_FAST 153,591 1.3% 35.7%
_LOAD_ATTR_METHOD_NO_DICT _CHECK_VALIDITY 146,431 1.3% 37.0%
_TO_BOOL_BOOL _GUARD_IS_TRUE_POP 145,145 1.3% 38.3%
atexit _SET_IP 144,945 1.3% 39.5%
_CHECK_VALIDITY _TO_BOOL_BOOL 144,260 1.3% 40.8%
_CALL_ISINSTANCE _CHECK_VALIDITY 142,796 1.2% 42.1%
_SET_IP _FOR_ITER_TIER_TWO 136,740 1.2% 43.2%
_CHECK_FUNCTION_EXACT_ARGS _CHECK_STACK_SPACE 136,617 1.2% 44.4%
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS 136,617 1.2% 45.6%
_INIT_CALL_PY_EXACT_ARGS _SAVE_RETURN_OFFSET 136,617 1.2% 46.8%
_SAVE_RETURN_OFFSET _PUSH_FRAME 136,617 1.2% 48.0%
_SET_IP _CALL_METHOD_DESCRIPTOR_FAST 134,172 1.2% 49.2%
_CALL_METHOD_DESCRIPTOR_FAST _CHECK_VALIDITY 128,940 1.1% 50.3%
_CHECK_VALIDITY _RESUME_CHECK 128,605 1.1% 51.4%
_PUSH_FRAME _CHECK_VALIDITY 128,018 1.1% 52.5%
_FOR_ITER_TIER_TWO _CHECK_VALIDITY 127,519 1.1% 53.7%
_UNPACK_SEQUENCE_TWO_TUPLE _STORE_FAST 122,284 1.1% 54.7%
_CONTAINS_OP _CHECK_VALIDITY 119,365 1.0% 55.8%
_SET_IP _CONTAINS_OP 119,355 1.0% 56.8%
_STORE_FAST _GUARD_GLOBALS_VERSION 116,993 1.0% 57.8%
_STORE_FAST _STORE_FAST 116,369 1.0% 58.8%
_LOAD_FAST _GUARD_TYPE_VERSION 115,447 1.0% 59.9%
_POP_FRAME _CHECK_VALIDITY 109,397 1.0% 60.8%
_SET_IP _CALL_ISINSTANCE 103,697 0.9% 61.7%
atexit _ITER_CHECK_LIST 101,238 0.9% 62.6%
_SET_IP _CHECK_FUNCTION_EXACT_ARGS 99,374 0.9% 63.5%
_LOAD_GLOBAL_MODULE _SET_IP 97,857 0.9% 64.3%
_GUARD_IS_FALSE_POP _LOAD_FAST 97,769 0.9% 65.2%
_LOAD_GLOBAL_BUILTINS _SET_IP 97,156 0.8% 66.0%
_GUARD_IS_TRUE_POP _LOAD_FAST 95,694 0.8% 66.9%
_ITER_CHECK_LIST _GUARD_NOT_EXHAUSTED_LIST 92,947 0.8% 67.7%
_SET_IP _BINARY_SUBSCR_DICT 90,748 0.8% 68.5%
_LOAD_GLOBAL_MODULE _LOAD_FAST 88,716 0.8% 69.2%
_LOAD_CONST_INLINE_BORROW _LOAD_FAST 83,894 0.7% 70.0%
_GUARD_IS_TRUE_POP _GUARD_GLOBALS_VERSION 83,617 0.7% 70.7%
_SET_IP _LOAD_ATTR 82,827 0.7% 71.4%
_SET_IP _COMPARE_OP_STR 82,410 0.7% 72.1%
_RESUME_CHECK _GUARD_GLOBALS_VERSION 82,349 0.7% 72.9%
_CHECK_VALIDITY _UNPACK_SEQUENCE_TWO_TUPLE 79,318 0.7% 73.5%
_GUARD_NOT_EXHAUSTED_LIST _ITER_NEXT_LIST 78,943 0.7% 74.2%
_COMPARE_OP_STR _CHECK_VALIDITY 75,901 0.7% 74.9%
_SET_IP _GUARD_BOTH_INT 69,759 0.6% 75.5%
_GUARD_TYPE_VERSION _CHECK_MANAGED_OBJECT_HAS_VALUES 64,321 0.6% 76.1%
_CHECK_MANAGED_OBJECT_HAS_VALUES _LOAD_ATTR_INSTANCE_VALUE 64,321 0.6% 76.6%
_GUARD_BOTH_INT _BINARY_OP_ADD_INT 62,519 0.5% 77.2%
_POP_TOP _LOAD_FAST 60,577 0.5% 77.7%
_CHECK_VALIDITY _POP_TOP 57,490 0.5% 78.2%
_CHECK_VALIDITY _JUMP_TO_TOP 55,969 0.5% 78.7%
_GUARD_TYPE_VERSION _LOAD_ATTR_SLOT 51,204 0.4% 79.1%
_CHECK_VALIDITY _POP_FRAME 50,063 0.4% 79.6%
_LOAD_ATTR_SLOT _SET_IP 49,289 0.4% 80.0%
_LOAD_CONST_INLINE_BORROW _POP_FRAME 49,252 0.4% 80.4%
_GUARD_IS_FALSE_POP _JUMP_TO_TOP 47,970 0.4% 80.9%
_LOAD_ATTR_METHOD_NO_DICT _SET_IP 47,696 0.4% 81.3%
_CALL_METHOD_DESCRIPTOR_FAST _SET_IP 47,408 0.4% 81.7%
_BINARY_SUBSCR_DICT _CHECK_VALIDITY 46,978 0.4% 82.1%
_LOAD_FAST _TO_BOOL_STR 44,361 0.4% 82.5%
_STORE_SUBSCR_DICT _CHECK_VALIDITY 44,238 0.4% 82.9%
_SET_IP _STORE_SUBSCR_DICT 44,175 0.4% 83.3%
_RESUME_CHECK _LOAD_FAST 44,169 0.4% 83.6%
_BINARY_SUBSCR_DICT _SET_IP 43,673 0.4% 84.0%
_TO_BOOL_STR _GUARD_IS_TRUE_POP 43,145 0.4% 84.4%
_LOAD_ATTR _CHECK_VALIDITY 43,063 0.4% 84.8%
_CHECK_VALIDITY _CALL_METHOD_DESCRIPTOR_FAST 42,176 0.4% 85.1%
_STORE_FAST _LOAD_CONST_INLINE_BORROW 41,992 0.4% 85.5%
_ITER_NEXT_LIST _UNPACK_SEQUENCE_TWO_TUPLE 41,772 0.4% 85.9%
_CHECK_VALIDITY _COMPARE_OP_FLOAT 41,655 0.4% 86.2%
_LOAD_ATTR _SET_IP 40,322 0.4% 86.6%
_GUARD_IS_FALSE_POP _LOAD_CONST_INLINE_BORROW 39,393 0.3% 86.9%
_CHECK_VALIDITY _CALL_ISINSTANCE 39,099 0.3% 87.3%
_CHECK_VALIDITY _GUARD_IS_NONE_POP 39,048 0.3% 87.6%
_ITER_CHECK_TUPLE _GUARD_NOT_EXHAUSTED_TUPLE 37,709 0.3% 87.9%
_ITER_NEXT_LIST _STORE_FAST 37,170 0.3% 88.3%
_SET_IP _LOAD_CONST_INLINE_BORROW 37,112 0.3% 88.6%
_GUARD_IS_NONE_POP _SET_IP 37,054 0.3% 88.9%
atexit _ITER_CHECK_TUPLE 35,909 0.3% 89.2%
_LOAD_FAST _BINARY_SUBSCR_STR_INT 33,275 0.3% 89.5%
_LOAD_ATTR_INSTANCE_VALUE _STORE_FAST 32,604 0.3% 89.8%
atexit _LOAD_FAST 31,315 0.3% 90.1%
_GUARD_NOT_EXHAUSTED_TUPLE _ITER_NEXT_TUPLE 30,136 0.3% 90.3%
_BINARY_OP_ADD_INT _STORE_FAST 29,829 0.3% 90.6%
_ITER_NEXT_TUPLE _STORE_FAST 29,229 0.3% 90.9%
_CHECK_VALIDITY _LOAD_CONST_INLINE_BORROW 26,541 0.2% 91.1%
_CHECK_VALIDITY _CHECK_FUNCTION_EXACT_ARGS 25,673 0.2% 91.3%
_SET_IP _COMPARE_OP_INT 25,400 0.2% 91.5%

The pair atexit _SET_IP is a byproduct of my using the sentinel value of 511 as the "last opcode" at the start of each JIT execution. Looks a bit odd, I can clean that up. =)

Currently, this branch works in the JIT by building a call to a _inc_uop_stats function into the JIT template, with the appropriate arguments (LASTUOP) passed in at runtime. My original idea was slightly different: since we the address to be incremented (_Py_stats->optimization_stats.opcode[lastopname].pair_count[opname]) at patch-time, we could just pass that memory location to the JIT template and manually increment that location. But for some reason I couldn't get that to work; perhaps I'll give it another go at some point.

That branch works in the JIT'd case, and I've built in what I think is necessary to run it sans-JIT, but... I realize I don't know how to build/run the Tier 2 interpreter without the JIT. I found the -X uops arg... is there a flag to pass to configure as well?

Related - Is there a simple way to run pyperformance locally with pystats enabled?

If it seems like this is on the right track at least, I can work on turning it into a PR. I may need a little guidance on code organization - I splashed the new function and its declaration in a little haphazardly. This in addition to figuring out how to make the stats call an "optional" part of the template.

After that, I think a script to "score" pairs/sequences of UOps (i.e. how much shorter they are when compiled together vs. separately) could be useful and interesting to build.

Thanks again for your support and time, I'm having a swell time getting to know the jit internals better.

@JeffersGlass
Copy link
Contributor Author

As a brief experiment, I ran (local) pyperformance on both the current 3.13.0a3 build and my experimental jitted version above with the 96 opcodes pairs listed in the previous post. In general, the jitted version is just a little slower, though that's to be expected from the caveats we've been talking about, and the fact that most of these superinstructions probably don't help much. A few specific benchmarks were faster, though

Curiously, bench_mp_pool was 4.77 times slower. I wonder why!

Stats for pyperf compare mainall.json jitall.json

mainall.json

Performance version: 1.10.0
Report on Linux-6.5.0-14-generic-x86_64-with-glibc2.35
Number of logical CPUs: 8
Start date: 2024-01-29 07:34:11.664325
End date: 2024-01-29 08:24:42.102909

jitall.json

Performance version: 1.10.0
Report on Linux-6.5.0-14-generic-x86_64-with-glibc2.35
Number of logical CPUs: 8
Start date: 2024-01-28 19:18:03.068205
End date: 2024-01-28 20:09:39.180202

2to3

Mean +- std dev: 261 ms +- 2 ms -> 274 ms +- 3 ms: 1.05x slower
Significant (t=-27.65)

async_generators

Mean +- std dev: 400 ms +- 4 ms -> 408 ms +- 5 ms: 1.02x slower
Significant (t=-10.18)

async_tree_cpu_io_mixed

Mean +- std dev: 657 ms +- 18 ms -> 737 ms +- 21 ms: 1.12x slower
Significant (t=-22.27)

async_tree_cpu_io_mixed_tg

Mean +- std dev: 662 ms +- 17 ms -> 752 ms +- 26 ms: 1.14x slower
Significant (t=-22.38)

async_tree_eager

Mean +- std dev: 110 ms +- 1 ms -> 121 ms +- 4 ms: 1.10x slower
Significant (t=-20.61)

async_tree_eager_cpu_io_mixed

Mean +- std dev: 411 ms +- 10 ms -> 477 ms +- 7 ms: 1.16x slower
Significant (t=-42.43)

async_tree_eager_cpu_io_mixed_tg

Mean +- std dev: 356 ms +- 7 ms -> 417 ms +- 6 ms: 1.17x slower
Significant (t=-51.94)

async_tree_eager_io

Mean +- std dev: 1.02 sec +- 0.05 sec -> 1.03 sec +- 0.05 sec: 1.01x slower
Not significant

async_tree_eager_io_tg

Mean +- std dev: 1.04 sec +- 0.05 sec -> 1.06 sec +- 0.06 sec: 1.02x slower
Not significant

async_tree_eager_memoization

Mean +- std dev: 257 ms +- 4 ms -> 262 ms +- 4 ms: 1.02x slower
Significant (t=-8.01)

async_tree_eager_memoization_tg

Mean +- std dev: 200 ms +- 6 ms -> 203 ms +- 6 ms: 1.01x slower
Not significant

async_tree_eager_tg

Mean +- std dev: 76.9 ms +- 1.3 ms -> 80.4 ms +- 1.9 ms: 1.05x slower
Significant (t=-11.73)

async_tree_io

Mean +- std dev: 1.01 sec +- 0.01 sec -> 1.04 sec +- 0.03 sec: 1.02x slower
Significant (t=-5.63)

async_tree_io_tg

Mean +- std dev: 1.01 sec +- 0.01 sec -> 1.04 sec +- 0.01 sec: 1.02x slower
Significant (t=-11.47)

async_tree_memoization

Mean +- std dev: 506 ms +- 16 ms -> 512 ms +- 14 ms: 1.01x slower
Not significant

async_tree_memoization_tg

Mean +- std dev: 507 ms +- 12 ms -> 517 ms +- 14 ms: 1.02x slower
Not significant

async_tree_none

Mean +- std dev: 396 ms +- 20 ms -> 413 ms +- 24 ms: 1.04x slower
Significant (t=-4.22)

async_tree_none_tg

Mean +- std dev: 403 ms +- 3 ms -> 408 ms +- 7 ms: 1.01x slower
Not significant

asyncio_tcp

Mean +- std dev: 380 ms +- 7 ms -> 385 ms +- 7 ms: 1.01x slower
Not significant

asyncio_tcp_ssl

Mean +- std dev: 1.29 sec +- 0.01 sec -> 1.31 sec +- 0.01 sec: 1.01x slower
Not significant

asyncio_websockets

Mean +- std dev: 435 ms +- 2 ms -> 437 ms +- 2 ms: 1.00x slower
Not significant

bench_mp_pool

Mean +- std dev: 10.5 ms +- 5.5 ms -> 50.2 ms +- 35.9 ms: 4.77x slower
Significant (t=-8.47)

bench_thread_pool

Mean +- std dev: 4.02 ms +- 0.80 ms -> 3.87 ms +- 0.91 ms: 1.04x faster
Not significant

chameleon

Mean +- std dev: 6.70 ms +- 0.08 ms -> 6.78 ms +- 0.22 ms: 1.01x slower
Not significant

chaos

Mean +- std dev: 56.3 ms +- 0.5 ms -> 65.9 ms +- 1.5 ms: 1.17x slower
Significant (t=-46.87)

comprehensions

Mean +- std dev: 15.9 us +- 0.1 us -> 17.3 us +- 0.2 us: 1.09x slower
Significant (t=-45.85)

coroutines

Mean +- std dev: 22.0 ms +- 0.1 ms -> 22.0 ms +- 0.3 ms: 1.00x faster
Not significant

create_gc_cycles

Mean +- std dev: 1.01 ms +- 0.01 ms -> 1.01 ms +- 0.01 ms: 1.00x slower
Not significant

crypto_pyaes

Mean +- std dev: 64.8 ms +- 0.7 ms -> 73.4 ms +- 1.8 ms: 1.13x slower
Significant (t=-34.25)

dask

Mean +- std dev: 630 ms +- 12 ms -> 643 ms +- 11 ms: 1.02x slower
Significant (t=-6.21)

deepcopy

Mean +- std dev: 337 us +- 3 us -> 339 us +- 4 us: 1.01x slower
Not significant

deepcopy_memo

Mean +- std dev: 33.9 us +- 0.5 us -> 34.5 us +- 0.7 us: 1.02x slower
Not significant

deepcopy_reduce

Mean +- std dev: 3.09 us +- 0.03 us -> 3.15 us +- 0.04 us: 1.02x slower
Not significant

deltablue

Mean +- std dev: 3.20 ms +- 0.05 ms -> 3.49 ms +- 0.04 ms: 1.09x slower
Significant (t=-36.14)

docutils

Mean +- std dev: 2.45 sec +- 0.02 sec -> 2.49 sec +- 0.03 sec: 1.02x slower
Not significant

dulwich_log

Mean +- std dev: 75.1 ms +- 1.1 ms -> 76.9 ms +- 0.9 ms: 1.02x slower
Significant (t=-10.13)

fannkuch

Mean +- std dev: 387 ms +- 4 ms -> 406 ms +- 7 ms: 1.05x slower
Significant (t=-19.60)

float

Mean +- std dev: 74.2 ms +- 0.9 ms -> 75.3 ms +- 0.9 ms: 1.01x slower
Not significant

gc_traversal

Mean +- std dev: 2.94 ms +- 0.01 ms -> 3.30 ms +- 0.02 ms: 1.12x slower
Significant (t=-107.75)

generators

Mean +- std dev: 30.8 ms +- 0.3 ms -> 26.2 ms +- 0.3 ms: 1.18x faster
Significant (t=84.31)

genshi_text

Mean +- std dev: 22.3 ms +- 0.3 ms -> 21.8 ms +- 0.3 ms: 1.02x faster
Significant (t=10.24)

genshi_xml

Mean +- std dev: 52.4 ms +- 0.8 ms -> 52.1 ms +- 1.0 ms: 1.01x faster
Not significant

go

Mean +- std dev: 124 ms +- 1 ms -> 133 ms +- 2 ms: 1.07x slower
Significant (t=-38.20)

hexiom

Mean +- std dev: 5.79 ms +- 0.04 ms -> 6.81 ms +- 0.08 ms: 1.18x slower
Significant (t=-84.83)

html5lib

Mean +- std dev: 62.0 ms +- 2.5 ms -> 63.3 ms +- 2.7 ms: 1.02x slower
Significant (t=-2.78)

json_dumps

Mean +- std dev: 10.2 ms +- 0.1 ms -> 10.4 ms +- 0.3 ms: 1.02x slower
Not significant

json_loads

Mean +- std dev: 25.8 us +- 0.7 us -> 26.2 us +- 0.2 us: 1.01x slower
Not significant

logging_format

Mean +- std dev: 7.20 us +- 0.16 us -> 7.50 us +- 0.21 us: 1.04x slower
Significant (t=-8.48)

logging_silent

Mean +- std dev: 97.1 ns +- 1.4 ns -> 100.8 ns +- 1.3 ns: 1.04x slower
Significant (t=-15.19)

logging_simple

Mean +- std dev: 6.33 us +- 0.14 us -> 6.53 us +- 0.15 us: 1.03x slower
Significant (t=-7.73)

mako

Mean +- std dev: 10.2 ms +- 0.1 ms -> 11.4 ms +- 0.6 ms: 1.12x slower
Significant (t=-16.03)

mdp

Mean +- std dev: 2.91 sec +- 0.06 sec -> 2.82 sec +- 0.02 sec: 1.03x faster
Significant (t=11.03)

meteor_contest

Mean +- std dev: 96.9 ms +- 0.9 ms -> 96.1 ms +- 0.5 ms: 1.01x faster
Not significant

nbody

Mean +- std dev: 78.0 ms +- 1.6 ms -> 84.0 ms +- 0.7 ms: 1.08x slower
Significant (t=-27.26)

nqueens

Mean +- std dev: 84.9 ms +- 0.7 ms -> 87.7 ms +- 1.3 ms: 1.03x slower
Significant (t=-14.92)

pathlib

Mean +- std dev: 20.3 ms +- 0.4 ms -> 20.9 ms +- 0.5 ms: 1.03x slower
Significant (t=-7.69)

pickle

Mean +- std dev: 11.3 us +- 0.1 us -> 11.2 us +- 0.2 us: 1.00x faster
Not significant

pickle_dict

Mean +- std dev: 31.8 us +- 0.2 us -> 30.7 us +- 0.2 us: 1.03x faster
Significant (t=35.19)

pickle_list

Mean +- std dev: 4.88 us +- 0.03 us -> 4.79 us +- 0.05 us: 1.02x faster
Not significant

pickle_pure_python

Mean +- std dev: 279 us +- 4 us -> 281 us +- 4 us: 1.01x slower
Not significant

pidigits

Mean +- std dev: 170 ms +- 0 ms -> 189 ms +- 2 ms: 1.11x slower
Significant (t=-65.49)

pprint_pformat

Mean +- std dev: 1.52 sec +- 0.01 sec -> 1.64 sec +- 0.03 sec: 1.08x slower
Significant (t=-29.38)

pprint_safe_repr

Mean +- std dev: 743 ms +- 7 ms -> 794 ms +- 21 ms: 1.07x slower
Significant (t=-18.02)

pyflate

Mean +- std dev: 403 ms +- 3 ms -> 435 ms +- 9 ms: 1.08x slower
Significant (t=-24.86)

python_startup

Mean +- std dev: 11.5 ms +- 1.6 ms -> 11.2 ms +- 1.2 ms: 1.03x faster
Significant (t=2.11)

python_startup_no_site

Mean +- std dev: 10.3 ms +- 1.4 ms -> 10.5 ms +- 1.5 ms: 1.02x slower
Not significant

raytrace

Mean +- std dev: 242 ms +- 1 ms -> 253 ms +- 2 ms: 1.05x slower
Significant (t=-37.83)

regex_compile

Mean +- std dev: 127 ms +- 1 ms -> 135 ms +- 1 ms: 1.06x slower
Significant (t=-50.19)

regex_dna

Mean +- std dev: 154 ms +- 1 ms -> 163 ms +- 1 ms: 1.06x slower
Significant (t=-58.46)

regex_effbot

Mean +- std dev: 2.80 ms +- 0.05 ms -> 2.73 ms +- 0.02 ms: 1.03x faster
Significant (t=10.54)

regex_v8

Mean +- std dev: 22.0 ms +- 0.1 ms -> 21.3 ms +- 0.1 ms: 1.03x faster
Significant (t=34.04)

richards

Mean +- std dev: 47.0 ms +- 0.8 ms -> 44.1 ms +- 0.5 ms: 1.06x faster
Significant (t=23.38)

richards_super

Mean +- std dev: 53.2 ms +- 1.1 ms -> 49.8 ms +- 1.0 ms: 1.07x faster
Significant (t=17.83)

scimark_fft

Mean +- std dev: 320 ms +- 4 ms -> 327 ms +- 5 ms: 1.02x slower
Significant (t=-8.71)

scimark_lu

Mean +- std dev: 108 ms +- 4 ms -> 107 ms +- 2 ms: 1.00x faster
Not significant

scimark_monte_carlo

Mean +- std dev: 61.9 ms +- 1.8 ms -> 63.3 ms +- 1.7 ms: 1.02x slower
Significant (t=-4.27)

scimark_sor

Mean +- std dev: 123 ms +- 3 ms -> 124 ms +- 3 ms: 1.01x slower
Not significant

scimark_sparse_mat_mult

Mean +- std dev: 4.08 ms +- 0.19 ms -> 4.85 ms +- 0.04 ms: 1.19x slower
Significant (t=-31.04)

spectral_norm

Mean +- std dev: 93.7 ms +- 0.5 ms -> 119.2 ms +- 1.7 ms: 1.27x slower
Significant (t=-112.63)

sqlglot_normalize

Mean +- std dev: 111 ms +- 1 ms -> 113 ms +- 1 ms: 1.02x slower
Not significant

sqlglot_optimize

Mean +- std dev: 55.0 ms +- 0.3 ms -> 56.5 ms +- 0.7 ms: 1.03x slower
Significant (t=-15.50)

sqlglot_parse

Mean +- std dev: 1.19 ms +- 0.01 ms -> 1.21 ms +- 0.02 ms: 1.02x slower
Significant (t=-9.11)

sqlglot_transpile

Mean +- std dev: 1.48 ms +- 0.02 ms -> 1.51 ms +- 0.02 ms: 1.02x slower
Significant (t=-11.45)

telco

Mean +- std dev: 8.00 ms +- 0.22 ms -> 7.99 ms +- 0.17 ms: 1.00x faster
Not significant

tomli_loads

Mean +- std dev: 2.00 sec +- 0.02 sec -> 2.05 sec +- 0.03 sec: 1.02x slower
Significant (t=-9.86)

tornado_http

Mean +- std dev: 126 ms +- 2 ms -> 124 ms +- 3 ms: 1.01x faster
Not significant

typing_runtime_protocols

Mean +- std dev: 117 us +- 2 us -> 117 us +- 2 us: 1.00x slower
Not significant

unpack_sequence

Mean +- std dev: 39.0 ns +- 0.2 ns -> 35.8 ns +- 0.4 ns: 1.09x faster
Significant (t=53.17)

unpickle

Mean +- std dev: 14.8 us +- 0.1 us -> 15.3 us +- 0.5 us: 1.03x slower
Significant (t=-6.83)

unpickle_list

Mean +- std dev: 4.53 us +- 0.04 us -> 4.72 us +- 0.04 us: 1.04x slower
Significant (t=-23.93)

unpickle_pure_python

Mean +- std dev: 208 us +- 2 us -> 224 us +- 8 us: 1.08x slower
Significant (t=-15.18)

xml_etree_generate

Mean +- std dev: 91.5 ms +- 1.0 ms -> 90.0 ms +- 1.3 ms: 1.02x faster
Not significant

xml_etree_iterparse

Mean +- std dev: 97.6 ms +- 1.6 ms -> 98.2 ms +- 1.0 ms: 1.01x slower
Not significant

xml_etree_parse

Mean +- std dev: 136 ms +- 2 ms -> 137 ms +- 2 ms: 1.01x slower
Not significant

xml_etree_process

Mean +- std dev: 60.6 ms +- 0.8 ms -> 60.3 ms +- 0.8 ms: 1.00x faster
Not significant

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Jan 29, 2024

Stats for pyperf compare mainall.json jitall.json

Could you please rerun the command with pyperf compare_to mainall.json jitall.json -G --table --table-format=md please? It will group the results by speedup/slowdown and produce a markdown table so things are easier to read. Thanks!

@JeffersGlass
Copy link
Contributor Author

Gladly! That is surely easier to read. Results are collapsed below.

Looks like overall a ~5% slowdown, with some significant outliers. I would guess the faster ones involve some specific opcode pairs from the list above in a useful way, enough to overcome the overhead of the additional compilation steps/superinstruction lookup.

pyperf stats as table

Benchmarks with tag 'apps':

Benchmark mainall jitall
2to3 261 ms 274 ms: 1.05x slower
chameleon 6.70 ms 6.78 ms: 1.01x slower
docutils 2.45 sec 2.49 sec: 1.02x slower
html5lib 62.0 ms 63.3 ms: 1.02x slower
tornado_http 126 ms 124 ms: 1.01x faster
Geometric mean (ref) 1.02x slower

Benchmarks with tag 'asyncio':

Benchmark mainall jitall
async_tree_memoization 506 ms 512 ms: 1.01x slower
async_tree_none_tg 403 ms 408 ms: 1.01x slower
async_tree_eager_memoization_tg 200 ms 203 ms: 1.01x slower
async_tree_memoization_tg 507 ms 517 ms: 1.02x slower
async_tree_eager_memoization 257 ms 262 ms: 1.02x slower
async_tree_io 1.01 sec 1.04 sec: 1.02x slower
async_tree_io_tg 1.01 sec 1.04 sec: 1.02x slower
async_tree_none 396 ms 413 ms: 1.04x slower
async_tree_eager_tg 76.9 ms 80.4 ms: 1.05x slower
async_tree_eager 110 ms 121 ms: 1.10x slower
async_tree_cpu_io_mixed 657 ms 737 ms: 1.12x slower
async_tree_cpu_io_mixed_tg 662 ms 752 ms: 1.14x slower
async_tree_eager_cpu_io_mixed 411 ms 477 ms: 1.16x slower
async_tree_eager_cpu_io_mixed_tg 356 ms 417 ms: 1.17x slower
Geometric mean (ref) 1.06x slower

Benchmark hidden because not significant (2): async_tree_eager_io, async_tree_eager_io_tg

Benchmarks with tag 'math':

Benchmark mainall jitall
float 74.2 ms 75.3 ms: 1.01x slower
nbody 78.0 ms 84.0 ms: 1.08x slower
pidigits 170 ms 189 ms: 1.11x slower
Geometric mean (ref) 1.07x slower

Benchmarks with tag 'regex':

Benchmark mainall jitall
regex_v8 22.0 ms 21.3 ms: 1.03x faster
regex_effbot 2.80 ms 2.73 ms: 1.03x faster
regex_dna 154 ms 163 ms: 1.06x slower
regex_compile 127 ms 135 ms: 1.06x slower
Geometric mean (ref) 1.02x slower

Benchmarks with tag 'serialize':

Benchmark mainall jitall
pickle_dict 31.8 us 30.7 us: 1.03x faster
pickle_list 4.88 us 4.79 us: 1.02x faster
xml_etree_generate 91.5 ms 90.0 ms: 1.02x faster
xml_etree_iterparse 97.6 ms 98.2 ms: 1.01x slower
pickle_pure_python 279 us 281 us: 1.01x slower
xml_etree_parse 136 ms 137 ms: 1.01x slower
json_loads 25.8 us 26.2 us: 1.01x slower
json_dumps 10.2 ms 10.4 ms: 1.02x slower
tomli_loads 2.00 sec 2.05 sec: 1.02x slower
unpickle 14.8 us 15.3 us: 1.03x slower
unpickle_list 4.53 us 4.72 us: 1.04x slower
unpickle_pure_python 208 us 224 us: 1.08x slower
Geometric mean (ref) 1.01x slower

Benchmark hidden because not significant (2): xml_etree_process, pickle

Benchmarks with tag 'startup':

Benchmark mainall jitall
python_startup 11.5 ms 11.2 ms: 1.03x faster
Geometric mean (ref) 1.00x faster

Benchmark hidden because not significant (1): python_startup_no_site

Benchmarks with tag 'template':

Benchmark mainall jitall
genshi_text 22.3 ms 21.8 ms: 1.02x faster
genshi_xml 52.4 ms 52.1 ms: 1.01x faster
mako 10.2 ms 11.4 ms: 1.12x slower
Geometric mean (ref) 1.03x slower

All benchmarks:

Benchmark mainall jitall
generators 30.8 ms 26.2 ms: 1.18x faster
unpack_sequence 39.0 ns 35.8 ns: 1.09x faster
richards_super 53.2 ms 49.8 ms: 1.07x faster
richards 47.0 ms 44.1 ms: 1.06x faster
pickle_dict 31.8 us 30.7 us: 1.03x faster
regex_v8 22.0 ms 21.3 ms: 1.03x faster
mdp 2.91 sec 2.82 sec: 1.03x faster
regex_effbot 2.80 ms 2.73 ms: 1.03x faster
python_startup 11.5 ms 11.2 ms: 1.03x faster
genshi_text 22.3 ms 21.8 ms: 1.02x faster
pickle_list 4.88 us 4.79 us: 1.02x faster
xml_etree_generate 91.5 ms 90.0 ms: 1.02x faster
tornado_http 126 ms 124 ms: 1.01x faster
meteor_contest 96.9 ms 96.1 ms: 1.01x faster
genshi_xml 52.4 ms 52.1 ms: 1.01x faster
asyncio_websockets 435 ms 437 ms: 1.00x slower
deepcopy 337 us 339 us: 1.01x slower
xml_etree_iterparse 97.6 ms 98.2 ms: 1.01x slower
pickle_pure_python 279 us 281 us: 1.01x slower
async_tree_memoization 506 ms 512 ms: 1.01x slower
xml_etree_parse 136 ms 137 ms: 1.01x slower
async_tree_none_tg 403 ms 408 ms: 1.01x slower
chameleon 6.70 ms 6.78 ms: 1.01x slower
async_tree_eager_memoization_tg 200 ms 203 ms: 1.01x slower
asyncio_tcp_ssl 1.29 sec 1.31 sec: 1.01x slower
float 74.2 ms 75.3 ms: 1.01x slower
json_loads 25.8 us 26.2 us: 1.01x slower
asyncio_tcp 380 ms 385 ms: 1.01x slower
docutils 2.45 sec 2.49 sec: 1.02x slower
deepcopy_memo 33.9 us 34.5 us: 1.02x slower
json_dumps 10.2 ms 10.4 ms: 1.02x slower
deepcopy_reduce 3.09 us 3.15 us: 1.02x slower
sqlglot_normalize 111 ms 113 ms: 1.02x slower
async_tree_memoization_tg 507 ms 517 ms: 1.02x slower
async_generators 400 ms 408 ms: 1.02x slower
sqlglot_parse 1.19 ms 1.21 ms: 1.02x slower
dask 630 ms 643 ms: 1.02x slower
html5lib 62.0 ms 63.3 ms: 1.02x slower
async_tree_eager_memoization 257 ms 262 ms: 1.02x slower
scimark_monte_carlo 61.9 ms 63.3 ms: 1.02x slower
async_tree_io 1.01 sec 1.04 sec: 1.02x slower
scimark_fft 320 ms 327 ms: 1.02x slower
sqlglot_transpile 1.48 ms 1.51 ms: 1.02x slower
tomli_loads 2.00 sec 2.05 sec: 1.02x slower
dulwich_log 75.1 ms 76.9 ms: 1.02x slower
async_tree_io_tg 1.01 sec 1.04 sec: 1.02x slower
sqlglot_optimize 55.0 ms 56.5 ms: 1.03x slower
pathlib 20.3 ms 20.9 ms: 1.03x slower
logging_simple 6.33 us 6.53 us: 1.03x slower
unpickle 14.8 us 15.3 us: 1.03x slower
nqueens 84.9 ms 87.7 ms: 1.03x slower
logging_silent 97.1 ns 101 ns: 1.04x slower
logging_format 7.20 us 7.50 us: 1.04x slower
unpickle_list 4.53 us 4.72 us: 1.04x slower
async_tree_none 396 ms 413 ms: 1.04x slower
raytrace 242 ms 253 ms: 1.05x slower
async_tree_eager_tg 76.9 ms 80.4 ms: 1.05x slower
2to3 261 ms 274 ms: 1.05x slower
fannkuch 387 ms 406 ms: 1.05x slower
regex_dna 154 ms 163 ms: 1.06x slower
regex_compile 127 ms 135 ms: 1.06x slower
pprint_safe_repr 743 ms 794 ms: 1.07x slower
go 124 ms 133 ms: 1.07x slower
unpickle_pure_python 208 us 224 us: 1.08x slower
nbody 78.0 ms 84.0 ms: 1.08x slower
pyflate 403 ms 435 ms: 1.08x slower
pprint_pformat 1.52 sec 1.64 sec: 1.08x slower
comprehensions 15.9 us 17.3 us: 1.09x slower
deltablue 3.20 ms 3.49 ms: 1.09x slower
async_tree_eager 110 ms 121 ms: 1.10x slower
pidigits 170 ms 189 ms: 1.11x slower
mako 10.2 ms 11.4 ms: 1.12x slower
async_tree_cpu_io_mixed 657 ms 737 ms: 1.12x slower
gc_traversal 2.94 ms 3.30 ms: 1.12x slower
crypto_pyaes 64.8 ms 73.4 ms: 1.13x slower
async_tree_cpu_io_mixed_tg 662 ms 752 ms: 1.14x slower
async_tree_eager_cpu_io_mixed 411 ms 477 ms: 1.16x slower
async_tree_eager_cpu_io_mixed_tg 356 ms 417 ms: 1.17x slower
chaos 56.3 ms 65.9 ms: 1.17x slower
hexiom 5.79 ms 6.81 ms: 1.18x slower
scimark_sparse_mat_mult 4.08 ms 4.85 ms: 1.19x slower
spectral_norm 93.7 ms 119 ms: 1.27x slower
bench_mp_pool 10.5 ms 50.2 ms: 4.77x slower
Geometric mean (ref) 1.05x slower

Benchmark hidden because not significant (12): bench_thread_pool, xml_etree_process, pickle, scimark_lu, coroutines, telco, create_gc_cycles, typing_runtime_protocols, scimark_sor, async_tree_eager_io, async_tree_eager_io_tg, python_startup_no_site

@JeffersGlass
Copy link
Contributor Author

JeffersGlass commented Jan 29, 2024

I took a preliminary pass at a tool for scoring sequences of UOps. Here's the results for the ~93 most-common-valid-pairs from above, comparing the lengths of the sum of the _code_body sections taken individually vs compiled into a single superinstruction:

UOp sequence scores from above (top 93 valid pairs)
UOps Sum of _code_body for Individual Ops length of _code_body Compiled Together Percentage
_TO_BOOL_BOOL / _GUARD_IS_TRUE_POP 149 104 69.8%
_ITER_CHECK_LIST / _GUARD_NOT_EXHAUSTED_LIST 149 104 69.8%
_ITER_CHECK_TUPLE / _GUARD_NOT_EXHAUSTED_TUPLE 149 104 69.8%
_CHECK_VALIDITY / _TO_BOOL_BOOL 142 101 71.13%
_CHECK_VALIDITY / _RESUME_CHECK 138 100 72.46%
_TO_BOOL_STR / _GUARD_IS_TRUE_POP 272 200 73.53%
_GUARD_GLOBALS_VERSION / _GUARD_BUILTINS_VERSION 190 141 74.21%
_CHECK_VALIDITY / _GUARD_IS_FALSE_POP 145 109 75.17%
_GUARD_IS_TRUE_POP / _GUARD_GLOBALS_VERSION 171 130 76.02%
_RESUME_CHECK / _GUARD_GLOBALS_VERSION 164 125 76.22%
_GUARD_TYPE_VERSION / _CHECK_MANAGED_OBJECT_HAS_VALUES 194 151 77.84%
_STORE_FAST / _STORE_FAST 208 163 78.37%
_LOAD_FAST / _LOAD_CONST_INLINE_BORROW 76 60 78.95%
_LOAD_CONST_INLINE_BORROW / _LOAD_FAST 76 60 78.95%
_GUARD_BUILTINS_VERSION / _LOAD_GLOBAL_BUILTINS 241 192 79.67%
_GUARD_GLOBALS_VERSION / _LOAD_GLOBAL_MODULE 241 192 79.67%
_GUARD_NOT_EXHAUSTED_TUPLE / _ITER_NEXT_TUPLE 129 104 80.62%
_GUARD_NOT_EXHAUSTED_LIST / _ITER_NEXT_LIST 132 107 81.06%
_CHECK_VALIDITY / _CHECK_FUNCTION_EXACT_ARGS 221 180 81.45%
_GUARD_IS_FALSE_POP / _LOAD_CONST_INLINE_BORROW 106 87 82.08%
_CHECK_MANAGED_OBJECT_HAS_VALUES / _LOAD_ATTR_INSTANCE_VALUE 336 276 82.14%
_LOAD_CONST_INLINE_BORROW / _SET_IP 73 60 82.19%
_SET_IP / _LOAD_CONST_INLINE_BORROW 73 60 82.19%
_LOAD_CONST_INLINE_BORROW / _POP_FRAME 141 116 82.27%
_PUSH_FRAME / _CHECK_VALIDITY 137 114 83.21%
_BINARY_OP_ADD_INT / _STORE_FAST 310 258 83.23%
_POP_TOP / _LOAD_FAST 122 102 83.61%
_ITER_NEXT_TUPLE / _STORE_FAST 157 132 84.08%
_ITER_NEXT_LIST / _STORE_FAST 160 135 84.38%
_GUARD_IS_FALSE_POP / _LOAD_FAST 122 103 84.43%
_GUARD_IS_TRUE_POP / _LOAD_FAST 122 103 84.43%
_STORE_FAST / _LOAD_CONST_INLINE_BORROW 134 114 85.07%
_LOAD_FAST / _SET_IP 89 76 85.39%
_CHECK_VALIDITY / _GUARD_IS_NONE_POP 203 174 85.71%
_LOAD_ATTR_METHOD_NO_DICT / _SET_IP 92 79 85.87%
_SAVE_RETURN_OFFSET / _PUSH_FRAME 95 82 86.32%
_CHECK_VALIDITY / _UNPACK_SEQUENCE_TWO_TUPLE 257 222 86.38%
_LOAD_FAST / _GUARD_TYPE_VERSION 125 108 86.4%
_SET_IP / _CHECK_VALIDITY 112 97 86.61%
_STORE_FAST / _LOAD_FAST 150 130 86.67%
_CHECK_FUNCTION_EXACT_ARGS / _CHECK_STACK_SPACE 279 242 86.74%
_CHECK_VALIDITY / _LOAD_CONST_INLINE_BORROW 99 86 86.87%
_GUARD_TYPE_VERSION / _LOAD_ATTR_SLOT 295 257 87.12%
_GUARD_TYPE_VERSION / _LOAD_ATTR_METHOD_NO_DICT 128 112 87.5%
_SET_IP / _GUARD_BOTH_INT 126 111 88.1%
_FOR_ITER_TIER_TWO / _CHECK_VALIDITY 317 280 88.33%
_SET_IP / _GUARD_TYPE_VERSION 122 108 88.52%
_CHECK_VALIDITY / _LOAD_FAST 115 102 88.7%
_RESUME_CHECK / _LOAD_FAST 115 102 88.7%
_STORE_SUBSCR_DICT / _CHECK_VALIDITY 297 264 88.89%
_LOAD_ATTR_METHOD_NO_DICT / _CHECK_VALIDITY 118 105 88.98%
_COMPARE_OP_STR / _CHECK_VALIDITY 343 307 89.5%
_CHECK_VALIDITY / _COMPARE_OP_FLOAT 391 350 89.51%
_BINARY_SUBSCR_DICT / _CHECK_VALIDITY 368 330 89.67%
_CALL_METHOD_DESCRIPTOR_FAST / _CHECK_VALIDITY 550 496 90.18%
_LOAD_FAST / _TO_BOOL_STR 242 219 90.5%
_ITER_NEXT_LIST / _UNPACK_SEQUENCE_TWO_TUPLE 244 221 90.57%
_LOAD_FAST / _GUARD_GLOBALS_VERSION 141 128 90.78%
_LOAD_FAST / _BINARY_SUBSCR_STR_INT 474 432 91.14%
_GUARD_BOTH_INT / _BINARY_OP_ADD_INT 289 264 91.35%
_POP_FRAME / _CHECK_VALIDITY 180 165 91.67%
_CHECK_VALIDITY / _CALL_ISINSTANCE 491 452 92.06%
_GUARD_IS_NONE_POP / _SET_IP 177 164 92.66%
_CALL_ISINSTANCE / _CHECK_VALIDITY 491 456 92.87%
_LOAD_GLOBAL_MODULE / _SET_IP 189 176 93.12%
_LOAD_GLOBAL_BUILTINS / _SET_IP 189 176 93.12%
_LOAD_ATTR_INSTANCE_VALUE / _STORE_FAST 325 303 93.23%
_CHECK_VALIDITY / _CALL_METHOD_DESCRIPTOR_FAST 550 513 93.27%
_SET_IP / _STORE_SUBSCR_DICT 271 253 93.36%
_CHECK_VALIDITY / _POP_TOP 145 136 93.79%
_SET_IP / _CHECK_FUNCTION_EXACT_ARGS 195 183 93.85%
_SET_IP / _COMPARE_OP_STR 317 299 94.32%
_CHECK_VALIDITY / _POP_FRAME 180 170 94.44%
_UNPACK_SEQUENCE_TWO_TUPLE / _STORE_FAST 292 276 94.52%
_SET_IP / _LOAD_ATTR 341 323 94.72%
_SET_IP / _BINARY_SUBSCR_DICT 342 324 94.74%
_SET_IP / _CONTAINS_OP 286 271 94.76%
_CHECK_VALIDITY / _STORE_FAST 173 164 94.8%
_SET_IP / _FOR_ITER_TIER_TWO 291 276 94.85%
_SET_IP / _CALL_METHOD_DESCRIPTOR_FAST 524 497 94.85%
_STORE_FAST / _GUARD_GLOBALS_VERSION 199 190 95.48%
_LOAD_ATTR / _SET_IP 341 326 95.6%
_BINARY_SUBSCR_DICT / _SET_IP 342 327 95.61%
_SET_IP / _COMPARE_OP_INT 411 393 95.62%
_CONTAINS_OP / _CHECK_VALIDITY 312 301 96.47%
_INIT_CALL_PY_EXACT_ARGS / _SAVE_RETURN_OFFSET 781 754 96.54%
_LOAD_ATTR / _CHECK_VALIDITY 367 355 96.73%
_SET_IP / _CALL_ISINSTANCE 465 452 97.2%
_CALL_METHOD_DESCRIPTOR_FAST / _SET_IP 524 511 97.52%
_CHECK_STACK_SPACE / _INIT_CALL_PY_EXACT_ARGS 881 867 98.41%
_LOAD_ATTR_SLOT / _SET_IP 259 255 98.46%
_LOAD_GLOBAL_BUILTINS / _LOAD_FAST 192 190 98.96%
_LOAD_GLOBAL_MODULE / _LOAD_FAST 192 190 98.96%

This list is using "length of code_body" as a proxy for "speed", which isn't going to be exactly correct. I.e, I don't expect a condensed superinstruction that's 70% the length of the sum of its parts to be 70% faster. But it's a simple heuristic to start with.

This is all on X86_64 Linux, using my overly-dynamic-template.c from the above example. With a different template.c, these raw numbers could be slightly different, but I think the relative effects would be similar.

It's fun that the top pair there, _TO_BOOL_BOOL / _GUARD_IS_TRUE_POP, is something that Brandt had called out as actually being a good candidate for being condensable. The other top candidates in this list - _ITER_CHECK_X / _GUARD_NOT_EXHAUSTED_X - also make sense.

I will get that tool tidied up as well - now that the main JIT branch has merged into main, that's a relatively straightforward PR, if having the tool in main would be generally useful. If not, I can publish it as a separate tool.

Also, I know I've gone on a bit of a tear here this week. If this is more spamming than useful, I'm happy to move these ideas/projects elsewhere.

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Jan 30, 2024

Wait I just realised if you were benchmarking main that is the cause of the slowdown -- main did not have the JIT merged yet when you benchmarked it. You need to benchmark main+JIT vs your-branch+JIT.

PS: main just got JIT merged in and it is currently faster without JIT than with JIT, due to micro operations overhead. So that explains your 5% slowdown. Actually if it's only 5% that's a huge improvement. It should be 7-10% slower. Which means your change might have caused a 2% speedup over the current JIT!

@mdboom
Copy link
Contributor

mdboom commented Jan 31, 2024

I maintain the benchmarking infrastructure for the team, so happy to answer any questions related to pyperformance etc.

@JeffersGlass wrote:

Is there a simple way to run pyperformance locally with pystats enabled?

If you have a build with --enable-pystats configured, pyperformance will automatically collect stats, and it's smart enough to exclude stats from the pyperformance/pyperf harness itself. I just do:

rm /tmp/py_stats/*
pyperformance run --python cpython/python {...any other args you are passing to pyperformance...}
Tools/scripts/summarize_stats.py

I realize I don't know how to build/run the Tier 2 interpreter without the JIT. I found the -X uops arg... is there a flag to pass to configure as well?

There's no configure flag in this case, but you need a non-JIT build. The easiest way to run this with pyperformance is with the PYTHON_UOPS=1 environment variable and telling pyperformance to pass it to its child processes with the --inherit-environ flag.

PYTHON_UOPS=1 pyperformance run --inherit-environ PYTHON_UOPS

Also, if you ever want to have a branch run on the official infrastructure, just ping a member of the Microsoft team on Discord -- it's very easy, but unfortunately we can't make it "self-serve on the open web" for security reasons.

@JeffersGlass
Copy link
Contributor Author

Thank you @mdboom for the information, it's been very helpful. I will gladly take you and the team up on running some benchmarks on official infrastructure once things stabilize a bit, I'd be curious how much difference running with, say, the top ~1000 most-promising opcode sequences makes.

Apologies for not responding sooner - as I mentioned to Brandt, I'm moving house at the moment, and it's significantly cutting into my spare time to dig into this.

That said, I am still working on getting pairs/triples/sequences of UOp counts into PyStats. I have a branch (uop-sequence-count) that successfully achieves this (for the Non-JIT tier 2 interpreter only, for now). Only two things remain before I submit it as a PR - making the maximum sequence length adjustable by an environment variable, and re-adding functionality to summarize_stats.py to display those stats in a clean way. I hope to get those worked out this week.

@markshannon
Copy link
Member

Thanks for working on this. Having stats for tier 2 pairs would be really useful.
I suspect that triples might be a bit too bulky, and chains too slow, but feel free to prove me wrong.

I don't think we want to implement superinstructions yet, as it will complicate register allocation and generating multiple stencils for instructions like LOAD_FAST where we want to fully inline the oparg.

When we do want superinstructions, in the not too far future, having the stats will help a lot.

@JeffersGlass
Copy link
Contributor Author

Thanks @markshannon, that PR is now live. It includes the ability to track pairs, triples or sequences of any length, but it defaults to only counting pairs.

10-4 on holding off on implementing superinstructions for a bit. With the ability to collect stats, I'd like to continue to play around and see if what results from adding longer chains of superinstructions, and their consequences for performance/size. I'll post results here as they come along.

@brandtbucher
Copy link
Member

Another thing I'd be interested in seeing (and that we may be able to incorporate sooner) is common superinstructions that can be formed without changing the existing instruction format. Meaning, there is at most one each of oparg, target, and operand used by all of the parts combined.

For example, _TO_BOOL_BOOL/_GUARD_IS_TRUE_POP wouldn't work, because there are two targets. But _LOAD_FAST / _LOAD_CONST_INLINE_BORROW would, since one half uses an oparg and the other half uses an operand (and neither uses target). So a combined opcode could be added without changing the instruction format.

(A related, but hairier, question would be identifying pairs with at most one unique value for each member. So _CHECK_VALIDITY / _TO_BOOL_BOOL could work, but only if both halves share the same target).

@JeffersGlass
Copy link
Contributor Author

JeffersGlass commented Feb 20, 2024

I've been doing a little work around this idea:

Common superinstructions that can be formed without changing the existing instruction format. Meaning, there is at most one each of oparg, target, and operand used by all of the parts combined.

Here's what the output could look like, if I understand the requirements correctly (which I may not have)

Pair counts for top 100 uop pairs
Pair Count Self Cumulative Oparg/Operand/Target Overlap
_LOAD_FAST _LOAD_FAST 4,407,701,606 5.2% 5.2% Oparg
_LOAD_FAST _SET_IP 3,705,075,653 4.4% 9.6% No Overlap
_LOAD_CONST_INLINE_BORROW _SET_IP 3,699,847,452 4.4% 13.9% Operand
_LOAD_FAST _LOAD_CONST_INLINE_BORROW 3,556,388,194 4.2% 18.1% No Overlap
_STORE_FAST _LOAD_FAST 2,970,845,320 3.5% 21.6% Oparg
_GUARD_IS_FALSE_POP _LOAD_FAST 2,418,045,803 2.8% 24.4% No Overlap
_CHECK_VALIDITY _GUARD_IS_FALSE_POP 2,415,103,501 2.8% 27.3% Target
_SET_IP _GUARD_BOTH_INT 1,918,954,490 2.3% 29.5% No Overlap
_CHECK_VALIDITY _LOAD_FAST 1,806,164,629 2.1% 31.7% No Overlap
_GUARD_BOTH_INT _BINARY_OP_ADD_INT 1,573,924,331 1.9% 33.5% No Overlap
_COMPARE_OP_STR _CHECK_VALIDITY 1,345,464,263 1.6% 35.1% No Overlap
_SET_IP _COMPARE_OP_STR 1,345,143,263 1.6% 36.7% No Overlap
_LOAD_FAST _GUARD_TYPE_VERSION 1,200,831,945 1.4% 38.1% No Overlap
_CONTAINS_OP _CHECK_VALIDITY 1,152,094,025 1.4% 39.5% No Overlap
_SET_IP _CONTAINS_OP 1,124,654,020 1.3% 40.8% No Overlap
_CHECK_VALIDITY _STORE_FAST 1,007,789,705 1.2% 42.0% No Overlap
_SET_IP _GUARD_TYPE_VERSION 969,652,291 1.1% 43.1% Operand
_BINARY_OP_ADD_INT _STORE_FAST 926,230,700 1.1% 44.2% No Overlap
_SET_IP _CHECK_VALIDITY 923,223,016 1.1% 45.3% No Overlap
_JUMP_TO_TOP _LOAD_FAST 900,085,718 1.1% 46.3% No Overlap
_LOAD_FAST _BINARY_SUBSCR_STR_INT 878,363,940 1.0% 47.4% No Overlap
_ITER_CHECK_LIST _GUARD_NOT_EXHAUSTED_LIST 779,199,392 0.9% 48.3% Target
_STORE_FAST _JUMP_TO_TOP 753,504,980 0.9% 49.2% No Overlap
_GUARD_TYPE_VERSION _CHECK_MANAGED_OBJECT_HAS_VALUES 726,049,155 0.9% 50.0% Target
_CHECK_MANAGED_OBJECT_HAS_VALUES _LOAD_ATTR_INSTANCE_VALUE 726,049,155 0.9% 50.9% Target
_BINARY_SUBSCR_STR_INT _STORE_FAST 678,699,540 0.8% 51.7% No Overlap
_LOAD_FAST _GUARD_BOTH_FLOAT 673,582,800 0.8% 52.5% No Overlap
_STORE_FAST _STORE_FAST 638,768,869 0.8% 53.2% Oparg
_CHECK_FUNCTION_EXACT_ARGS _CHECK_STACK_SPACE 630,682,238 0.7% 54.0% Oparg,Target
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS 630,682,238 0.7% 54.7% Oparg
_SAVE_RETURN_OFFSET _PUSH_FRAME 630,682,238 0.7% 55.5% No Overlap
_INIT_CALL_PY_EXACT_ARGS _SAVE_RETURN_OFFSET 630,682,238 0.7% 56.2% Oparg
_BINARY_SUBSCR _CHECK_VALIDITY 625,930,060 0.7% 56.9% No Overlap
_GUARD_NOT_EXHAUSTED_LIST _ITER_NEXT_LIST 620,218,192 0.7% 57.7% No Overlap
_GUARD_BOTH_FLOAT _BINARY_OP_MULTIPLY_FLOAT 583,876,920 0.7% 58.4% No Overlap
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST 580,613,866 0.7% 59.0% No Overlap
_PUSH_FRAME _CHECK_VALIDITY 560,185,027 0.7% 59.7% No Overlap
_CHECK_VALIDITY _RESUME_CHECK 552,221,449 0.7% 60.4% Target
_SET_IP _BINARY_SUBSCR 542,547,460 0.6% 61.0% No Overlap
_GUARD_TYPE_VERSION _GUARD_DORV_VALUES_INST_ATTR_FROM_DICT 503,214,425 0.6% 61.6% Target
_GUARD_DORV_VALUES_INST_ATTR_FROM_DICT _GUARD_KEYS_VERSION 503,197,385 0.6% 62.2% Target
_SET_IP _CHECK_FUNCTION_EXACT_ARGS 481,795,536 0.6% 62.7% Operand
_ITER_CHECK_RANGE _GUARD_NOT_EXHAUSTED_RANGE 476,451,707 0.6% 63.3% Target
_GUARD_KEYS_VERSION _LOAD_ATTR_METHOD_WITH_VALUES 466,355,977 0.5% 63.9% Operand
_SET_IP _ITER_CHECK_RANGE 464,074,716 0.5% 64.4% No Overlap
_GUARD_NOT_EXHAUSTED_RANGE _ITER_NEXT_RANGE 447,347,483 0.5% 64.9% No Overlap
_ITER_NEXT_RANGE _CHECK_VALIDITY 446,621,723 0.5% 65.5% No Overlap
_LOAD_ATTR_METHOD_WITH_VALUES _CHECK_VALIDITY 427,919,622 0.5% 66.0% No Overlap
_CHECK_VALIDITY _GUARD_IS_TRUE_POP 427,035,067 0.5% 66.5% Target
_GUARD_TYPE_VERSION _LOAD_ATTR_SLOT 402,486,129 0.5% 66.9% Operand,Target
_TO_BOOL_BOOL _GUARD_IS_FALSE_POP 389,785,090 0.5% 67.4% Target
_ITER_NEXT_LIST _STORE_FAST 388,388,340 0.5% 67.9% No Overlap
_BINARY_OP_ADD_INT _SET_IP 385,864,500 0.5% 68.3% No Overlap
_SET_IP _BINARY_OP 355,010,336 0.4% 68.7% No Overlap
_GUARD_TYPE_VERSION _LOAD_ATTR_METHOD_NO_DICT 352,033,430 0.4% 69.1% Operand
_RESUME_CHECK _LOAD_FAST 334,378,537 0.4% 69.5% No Overlap
_UNPACK_SEQUENCE_TWO_TUPLE _STORE_FAST 325,547,809 0.4% 69.9% Oparg
_CHECK_GLOBALS _CHECK_BUILTINS 293,902,194 0.3% 70.3% Operand,Target
_SET_IP _COMPARE_OP_INT 291,179,987 0.3% 70.6% No Overlap
_ITER_CHECK_TUPLE _GUARD_NOT_EXHAUSTED_TUPLE 283,729,558 0.3% 70.9% Target
_COMPARE_OP_INT _CHECK_VALIDITY 283,685,807 0.3% 71.3% Target
_TO_BOOL_BOOL _GUARD_IS_TRUE_POP 282,185,546 0.3% 71.6% Target
_JUMP_TO_TOP _SET_IP 281,640,840 0.3% 71.9% No Overlap
_SET_IP _LOAD_DEREF 271,925,705 0.3% 72.3% No Overlap
_LOAD_ATTR_INSTANCE_VALUE _SET_IP 271,586,954 0.3% 72.6% Operand
_GUARD_IS_TRUE_POP _LOAD_FAST 266,152,219 0.3% 72.9% No Overlap
_GUARD_BOTH_FLOAT _BINARY_OP_ADD_FLOAT 264,218,020 0.3% 73.2% No Overlap
_BINARY_OP_MULTIPLY_FLOAT _GUARD_BOTH_FLOAT 263,269,040 0.3% 73.5% No Overlap
_LOAD_CONST_INLINE_BORROW _LOAD_CONST_INLINE_BORROW 262,087,280 0.3% 73.8% Operand
_LOAD_DEREF _CHECK_VALIDITY 262,007,699 0.3% 74.1% No Overlap
_SET_IP _CALL_BUILTIN_FAST 261,271,024 0.3% 74.4% No Overlap
_CALL_BUILTIN_FAST _CHECK_VALIDITY 260,762,504 0.3% 74.7% Target
_GUARD_IS_TRUE_POP _JUMP_TO_TOP 255,131,549 0.3% 75.0% No Overlap
_STORE_FAST _SET_IP 253,836,446 0.3% 75.3% No Overlap
_CHECK_VALIDITY _TO_BOOL_BOOL 252,407,392 0.3% 75.6% Target
_LOAD_CONST_INLINE _SET_IP 244,348,904 0.3% 75.9% Operand
_SWAP _SET_IP 238,790,790 0.3% 76.2% No Overlap
_LOAD_ATTR_SLOT _SET_IP 237,960,883 0.3% 76.5% Operand
_CHECK_VALIDITY _LOAD_CONST_INLINE_BORROW 237,291,046 0.3% 76.8% No Overlap
_LOAD_CONST_INLINE_BORROW _LOAD_FAST 234,078,060 0.3% 77.0% No Overlap
_ITER_NEXT_LIST _UNPACK_SEQUENCE_TWO_TUPLE 227,025,132 0.3% 77.3% No Overlap
_SET_IP _LOAD_ATTR 223,273,337 0.3% 77.6% No Overlap
_PUSH_NULL _LOAD_FAST 221,336,879 0.3% 77.8% No Overlap
_STORE_SUBSCR_LIST_INT _CHECK_VALIDITY 220,401,840 0.3% 78.1% Target
_SET_IP _STORE_SUBSCR_LIST_INT 220,401,840 0.3% 78.4% No Overlap
_COPY _COPY 215,498,100 0.3% 78.6% Oparg
_SWAP _SWAP 215,498,100 0.3% 78.9% Oparg
_LOAD_ATTR_METHOD_NO_DICT _SET_IP 211,443,420 0.2% 79.1% Operand
_GUARD_BOTH_INT _BINARY_OP_SUBTRACT_INT 210,751,793 0.2% 79.4% No Overlap
_LOAD_ATTR_INSTANCE_VALUE _LOAD_FAST 203,211,931 0.2% 79.6% Oparg
_CHECK_VALIDITY _EXIT_TRACE 202,589,508 0.2% 79.8% Target
_BINARY_SUBSCR_STR_INT _LOAD_FAST 199,431,900 0.2% 80.1% No Overlap
_LOAD_FAST _LOAD_CONST_INLINE 197,928,387 0.2% 80.3% No Overlap
_SET_IP _FOR_ITER_TIER_TWO 193,904,331 0.2% 80.5% No Overlap
_GUARD_BOTH_FLOAT _BINARY_OP_SUBTRACT_FLOAT 189,087,717 0.2% 80.8% No Overlap
_STORE_SUBSCR _CHECK_VALIDITY 189,007,080 0.2% 81.0% No Overlap
_SET_IP _STORE_SUBSCR 189,007,080 0.2% 81.2% No Overlap
_GUARD_NOT_EXHAUSTED_TUPLE _ITER_NEXT_TUPLE 187,387,819 0.2% 81.4% No Overlap
_ITER_NEXT_TUPLE _STORE_FAST 186,951,639 0.2% 81.6% No Overlap
_LOAD_ATTR _CHECK_VALIDITY 184,252,330 0.2% 81.9% No Overlap

The conditions I think I understand for whether each UOp uses the three input kinds is:

  • oparg can be detected using the HAS_OPARG flag already present in the metadata
  • operand can be detected using the same logic in the Tier2 generator, which I'd extract to a new HAS_OPERAND flag
  • target can be detected if the uop has the HAS_JUMP, HAS_EXIT, or HAS_DEOPT flag... yes? This is the one I'm least confident of my understanding on.

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Feb 24, 2024

I propose that we automatically generate templates for all permutations within a single macro of uops as well.

Here's an example:

        macro(CALL_PY_EXACT_ARGS) =
            unused/1 + // Skip over the counter
            _CHECK_PEP_523 +
            _CHECK_FUNCTION_EXACT_ARGS +
            _CHECK_STACK_SPACE +
            _INIT_CALL_PY_EXACT_ARGS +
            _SAVE_RETURN_OFFSET +
            _PUSH_FRAME;

Some stuff can be eliminated by guard elimination. A simple heuristic would be

  1. if A --> B and B --> C
  2. AND A, B, C all part of the same macroinstruction,

then we should fuse them and make the super instructions A + B + C, B + C, A + B.

The second condition is crucial -- because we know this chain is not speculative, it is guaranteed to occur.
The heuristic would automatically generate the super instruction for the above macro.
_CHECK_PEP_523 + _CHECK_FUNCTION_EXACT_ARGS + _CHECK_STACK_SPACE + _INIT_CALL_PY_EXACT_ARGS + _SAVE_RETURN_OFFSET + _PUSH_FRAME and all other valid permutation chains.

From the table above, there is indeed a commonly occurring permutation: _CHECK_FUNCTION_EXACT_ARGS + _CHECK_STACK_SPACE + _INIT_CALL_PY_EXACT_ARGS + _SAVE_RETURN_OFFSET + _PUSH_FRAME

_CHECK_FUNCTION_EXACT_ARGS _CHECK_STACK_SPACE	630,682,238	0.7%	54.0%	Oparg,Target
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS	630,682,238	0.7%	54.7%	Oparg
_SAVE_RETURN_OFFSET _PUSH_FRAME	630,682,238	0.7%	55.5%	No Overlap
_INIT_CALL_PY_EXACT_ARGS _SAVE_RETURN_OFFSET	630,682,238	0.7%	56.2%	Oparg

To find the longest chain at runtime, we can automatically generate another abstract interpreter from the macro definition that finds longest matching chains on the first occurence of an instruction it sees, and replaces them. I can work on this part if y'all are keen on the idea.

Reasoning:
Uops are a great IR for the tier 2 optimizer and analysis, but not ideal for optimized machine code generation. They mainly have two negatives right now:

  1. Smaller region of code means LLVM can optimize less. The sum of a chain of uops together is greater than the equivalent macro instruction.
  2. More dispatch overhead.

Currently the optimizer doesn't optimize all instructions. Just as a stat for figurative purposes, only 30% of _BINARY_OP_ADD_INT is eliminated. We should regain their original performance by combining them back to their equivalent macro instruction if possible.

upper bound
The maximum permutations a sequence of uops can have, only allowing sequence >= 2, and valid order of uops, is
$$\sum_{n=2}^{n=k} 2^{n-2}$$
where k is the number of uops in that macro instruction.

The macro with the highest uop count is CALL_BOUND_METHOD_EXACT_ARGS with 8 instructions. That means 126 possible permutations. That's IMO pretty acceptable.

@JeffersGlass
Copy link
Contributor Author

Here's some updated pair counts after rebasing from main. Also some performance data, exploring which subsets of superinstructions might be most valuable. For the moment, this incorporates both "format compatible" superinstructions that don't have overlapping oparg/operand/target, as well as incompatible ones that have overlap.

These are the top 500 Uop pairs when running pyperformance, as of where main was on Friday evening:

Pair counts for top 500 uop pairs

Pairs of specialized operations that deoptimize and are then followed by
the corresponding unspecialized instruction are not counted as pairs.

Pair Count Self Cumulative
_LOAD_CONST_INLINE_BORROW _SET_IP 4,918,363,112 3.9% 3.9%
_CHECK_VALIDITY _GUARD_IS_FALSE_POP 2,969,434,496 2.4% 6.3%
_START_EXECUTOR _CHECK_VALIDITY 2,264,639,698 1.8% 8.1%
_SET_IP _GUARD_BOTH_INT 2,191,103,868 1.7% 9.8%
_LOAD_FAST_0 _GUARD_TYPE_VERSION 1,459,822,326 1.2% 11.0%
_GUARD_BOTH_INT _BINARY_OP_ADD_INT 1,423,704,735 1.1% 12.1%
_JUMP_TO_TOP _CHECK_VALIDITY 1,420,911,788 1.1% 13.3%
_GUARD_IS_FALSE_POP _LOAD_FAST_7 1,399,351,760 1.1% 14.4%
_COMPARE_OP_STR _CHECK_VALIDITY 1,355,624,988 1.1% 15.4%
_LOAD_FAST_7 _LOAD_CONST_INLINE_BORROW 1,353,346,729 1.1% 16.5%
_CONTAINS_OP _CHECK_VALIDITY 1,337,942,123 1.1% 17.6%
_SET_IP _GUARD_TYPE_VERSION 1,335,297,273 1.1% 18.6%
_GUARD_TYPE_VERSION _CHECK_MANAGED_OBJECT_HAS_VALUES 1,304,807,114 1.0% 19.7%
_CHECK_MANAGED_OBJECT_HAS_VALUES _LOAD_ATTR_INSTANCE_VALUE_0 1,300,440,674 1.0% 20.7%
_SET_IP _CONTAINS_OP 1,285,270,275 1.0% 21.7%
_LOAD_FAST_1 _SET_IP 1,216,476,888 1.0% 22.7%
_CHECK_VALIDITY _LOAD_FAST_0 1,093,152,233 0.9% 23.6%
_LOAD_FAST_3 _SET_IP 1,014,034,280 0.8% 24.4%
_LOAD_FAST_0 _LOAD_FAST_1 988,738,960 0.8% 25.2%
_CHECK_VALIDITY _LOAD_FAST_1 952,849,438 0.8% 25.9%
_CHECK_VALIDITY _ITER_CHECK_LIST 917,814,385 0.7% 26.7%
_LOAD_FAST_1 _LOAD_CONST_INLINE_BORROW 904,887,014 0.7% 27.4%
_LOAD_FAST_1 _BINARY_SUBSCR_STR_INT 881,863,700 0.7% 28.1%
_ITER_CHECK_LIST _GUARD_NOT_EXHAUSTED_LIST 874,993,105 0.7% 28.8%
_BINARY_OP_ADD_INT _STORE_FAST_1 857,669,140 0.7% 29.5%
_BINARY_SUBSCR _CHECK_VALIDITY 848,787,380 0.7% 30.1%
_LOAD_FAST_0 _SET_IP 846,004,624 0.7% 30.8%
_TO_BOOL_BOOL _GUARD_IS_FALSE_POP 830,506,949 0.7% 31.5%
_CHECK_FUNCTION_EXACT_ARGS _CHECK_STACK_SPACE 822,036,631 0.7% 32.1%
_SAVE_RETURN_OFFSET _PUSH_FRAME 822,020,431 0.7% 32.8%
_PUSH_FRAME _CHECK_VALIDITY 821,630,772 0.7% 33.4%
_CHECK_VALIDITY _RESUME_CHECK 813,263,876 0.6% 34.1%
_STORE_FAST _STORE_FAST 810,767,040 0.6% 34.7%
_CHECK_VALIDITY _SET_IP 807,454,833 0.6% 35.4%
_GUARD_IS_FALSE_POP _LOAD_FAST_1 746,213,560 0.6% 36.0%
_CHECK_VALIDITY _TO_BOOL_BOOL 727,644,702 0.6% 36.5%
_GUARD_NOT_EXHAUSTED_LIST _ITER_NEXT_LIST 727,407,997 0.6% 37.1%
_CALL_BUILTIN_FAST _CHECK_VALIDITY 714,251,953 0.6% 37.7%
_SET_IP _CALL_BUILTIN_FAST 713,623,873 0.6% 38.3%
_CHECK_GLOBALS _CHECK_BUILTINS 711,616,736 0.6% 38.8%
_CHECK_VALIDITY _GUARD_IS_TRUE_POP 705,646,254 0.6% 39.4%
_GUARD_BOTH_UNICODE _COMPARE_OP_STR 695,452,528 0.6% 39.9%
_SET_IP _GUARD_BOTH_UNICODE 694,580,668 0.6% 40.5%
_START_EXECUTOR _CHECK_VALIDITY_AND_SET_IP 694,400,296 0.6% 41.1%
_LOAD_FAST_5 _LOAD_CONST_INLINE_BORROW 686,152,960 0.5% 41.6%
_STORE_FAST_7 _LOAD_FAST_7 685,391,500 0.5% 42.1%
_LOAD_FAST_7 _LOAD_FAST_3 678,857,760 0.5% 42.7%
_BINARY_SUBSCR_STR_INT _STORE_FAST_7 674,395,140 0.5% 43.2%
_STORE_FAST_1 _JUMP_TO_TOP 663,656,340 0.5% 43.7%
_SET_IP _COMPARE_OP_STR 660,496,080 0.5% 44.3%
_SET_IP _BINARY_SUBSCR 655,631,060 0.5% 44.8%
_LOAD_FAST _SET_IP 650,471,872 0.5% 45.3%
_LOAD_FAST _LOAD_CONST_INLINE_BORROW 638,677,160 0.5% 45.8%
_LOAD_ATTR _CHECK_VALIDITY 632,782,284 0.5% 46.3%
_CHECK_VALIDITY _LOAD_FAST 631,359,031 0.5% 46.8%
_GUARD_TYPE_VERSION _GUARD_DORV_VALUES_INST_ATTR_FROM_DICT 601,867,358 0.5% 47.3%
_GUARD_DORV_VALUES_INST_ATTR_FROM_DICT _GUARD_KEYS_VERSION 601,865,498 0.5% 47.8%
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST_5 595,916,760 0.5% 48.3%
_LOAD_FAST_2 _SET_IP 580,879,838 0.5% 48.7%
_LOAD_FAST_4 _SET_IP 578,566,486 0.5% 49.2%
_SET_IP _CHECK_FUNCTION_EXACT_ARGS 571,568,739 0.5% 49.6%
_GUARD_KEYS_VERSION _LOAD_ATTR_METHOD_WITH_VALUES 554,428,258 0.4% 50.1%
_GUARD_BOTH_INT _COMPARE_OP_INT 552,599,400 0.4% 50.5%
_GUARD_BOTH_FLOAT _BINARY_OP_MULTIPLY_FLOAT 547,321,100 0.4% 51.0%
_COMPARE_OP_INT _CHECK_VALIDITY 546,886,480 0.4% 51.4%
_TO_BOOL_BOOL _GUARD_IS_TRUE_POP 531,008,516 0.4% 51.8%
_CHECK_VALIDITY _STORE_FAST 523,929,242 0.4% 52.2%
_CHECK_VALIDITY _LOAD_CONST_INLINE_BORROW 522,579,909 0.4% 52.6%
_STORE_FAST _LOAD_FAST 522,159,679 0.4% 53.1%
_LOAD_CONST_INLINE_BORROW _LOAD_CONST_INLINE_BORROW 517,421,780 0.4% 53.5%
_CHECK_VALIDITY _EXIT_TRACE 512,905,025 0.4% 53.9%
_ITER_CHECK_RANGE _GUARD_NOT_EXHAUSTED_RANGE 512,616,412 0.4% 54.3%
_SET_IP _ITER_CHECK_RANGE 497,802,170 0.4% 54.7%
_SET_IP _LOAD_ATTR 497,569,794 0.4% 55.1%
_GUARD_NOT_EXHAUSTED_RANGE _ITER_NEXT_RANGE 487,463,800 0.4% 55.5%
_ITER_NEXT_RANGE _CHECK_VALIDITY 486,047,800 0.4% 55.9%
_LOAD_FAST _LOAD_FAST 464,047,634 0.4% 56.2%
_LOAD_ATTR_METHOD_WITH_VALUES _CHECK_VALIDITY 463,541,726 0.4% 56.6%
_GUARD_TYPE_VERSION _LOAD_ATTR_METHOD_NO_DICT 453,191,279 0.4% 57.0%
_SET_IP _BINARY_OP 449,837,780 0.4% 57.3%
_GUARD_TYPE_VERSION _LOAD_ATTR_SLOT_0 444,413,782 0.4% 57.7%
_RESUME_CHECK _LOAD_FAST_0 425,106,151 0.3% 58.0%
_BINARY_OP_ADD_INT _SET_IP 416,290,640 0.3% 58.3%
_LOAD_FAST_6 _LOAD_CONST_INLINE_BORROW 405,386,040 0.3% 58.7%
_GUARD_IS_FALSE_POP _LOAD_CONST_INLINE_WITH_NULL 403,364,660 0.3% 59.0%
_SET_IP _BINARY_OP_ADD_INT 382,082,580 0.3% 59.3%
_CHECK_VALIDITY _LOAD_FAST_2 381,757,711 0.3% 59.6%
_CALL_BUILTIN_O _CHECK_VALIDITY 380,153,457 0.3% 59.9%
_LOAD_FAST_3 _LOAD_FAST_4 372,227,783 0.3% 60.2%
_SET_IP _CALL_BUILTIN_O 365,069,977 0.3% 60.5%
_CHECK_BUILTINS _LOAD_CONST_INLINE_WITH_NULL 356,017,693 0.3% 60.8%
_LOAD_ATTR_INSTANCE_VALUE_0 _SET_IP 349,455,207 0.3% 61.0%
_CHECK_VALIDITY _ITER_CHECK_TUPLE 340,832,212 0.3% 61.3%
_CHECK_VALIDITY_AND_SET_IP _LOAD_ATTR 336,466,159 0.3% 61.6%
_LOAD_CONST_INLINE _SET_IP 325,421,257 0.3% 61.8%
_SWAP _SET_IP 319,885,628 0.3% 62.1%
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST_1 315,916,373 0.3% 62.3%
_LOAD_FAST_2 _LOAD_FAST_3 296,105,320 0.2% 62.6%
_SET_IP _LOAD_DEREF 292,326,540 0.2% 62.8%
_ITER_CHECK_TUPLE _GUARD_NOT_EXHAUSTED_TUPLE 288,779,632 0.2% 63.0%
_LOAD_ATTR_INSTANCE_VALUE_0 _TO_BOOL_BOOL 286,918,299 0.2% 63.3%
_LOAD_DEREF _CHECK_VALIDITY 280,070,820 0.2% 63.5%
_CHECK_VALIDITY _STORE_FAST_6 268,077,620 0.2% 63.7%
_LOAD_ATTR_SLOT_0 _SET_IP 267,796,778 0.2% 63.9%
_STORE_FAST _LOAD_FAST_0 266,956,143 0.2% 64.1%
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS_4 264,603,680 0.2% 64.3%
_INIT_CALL_PY_EXACT_ARGS_4 _SAVE_RETURN_OFFSET 264,603,680 0.2% 64.6%
_GUARD_IS_TRUE_POP _JUMP_TO_TOP 263,427,350 0.2% 64.8%
_COPY _COPY 258,412,840 0.2% 65.0%
_SWAP _SWAP 258,412,840 0.2% 65.2%
_CHECK_VALIDITY_AND_SET_IP _CHECK_FUNCTION_EXACT_ARGS 255,792,112 0.2% 65.4%
_POP_FRAME _CHECK_VALIDITY 255,171,976 0.2% 65.6%
_CHECK_VALIDITY _LOAD_FAST_6 254,413,899 0.2% 65.8%
_LOAD_FAST _GUARD_BOTH_FLOAT 252,484,080 0.2% 66.0%
_LOAD_FAST_1 _LOAD_FAST 251,462,700 0.2% 66.2%
_CHECK_VALIDITY_AND_SET_IP _BINARY_SUBSCR 250,658,220 0.2% 66.4%
_SET_IP _STORE_SUBSCR_LIST_INT 247,430,260 0.2% 66.6%
_STORE_SUBSCR_LIST_INT _CHECK_VALIDITY 247,379,380 0.2% 66.8%
_GUARD_IS_FALSE_POP _LOAD_FAST_0 243,101,703 0.2% 67.0%
_GUARD_BOTH_FLOAT _BINARY_OP_ADD_FLOAT 240,825,200 0.2% 67.2%
_CHECK_VALIDITY _STORE_FAST_3 238,876,949 0.2% 67.4%
_LOAD_ATTR_METHOD_NO_DICT _CHECK_VALIDITY_AND_SET_IP 237,010,581 0.2% 67.5%
_SET_IP _BUILD_TUPLE 232,345,271 0.2% 67.7%
_CHECK_VALIDITY _LOAD_FAST_5 229,054,560 0.2% 67.9%
_ITER_NEXT_LIST _UNPACK_SEQUENCE_TWO_TUPLE 228,795,689 0.2% 68.1%
_CHECK_VALIDITY _POP_TOP 227,671,252 0.2% 68.3%
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS_1 225,298,516 0.2% 68.5%
_INIT_CALL_PY_EXACT_ARGS_1 _SAVE_RETURN_OFFSET 225,298,516 0.2% 68.6%
_STORE_SUBSCR _CHECK_VALIDITY 221,273,340 0.2% 68.8%
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS_0 218,067,892 0.2% 69.0%
_INIT_CALL_PY_EXACT_ARGS_0 _SAVE_RETURN_OFFSET 218,067,892 0.2% 69.2%
_CHECK_VALIDITY _IS_OP 217,218,360 0.2% 69.3%
_LOAD_ATTR_METHOD_NO_DICT _CHECK_VALIDITY 216,180,698 0.2% 69.5%
_STORE_FAST_6 _LOAD_CONST_INLINE_WITH_NULL 216,000,120 0.2% 69.7%
_SET_IP _FOR_ITER_TIER_TWO 215,603,443 0.2% 69.8%
_SET_IP _STORE_SUBSCR 215,511,720 0.2% 70.0%
_GUARD_IS_TRUE_POP _LOAD_FAST_0 214,557,913 0.2% 70.2%
_FOR_ITER_TIER_TWO _CHECK_VALIDITY 204,795,848 0.2% 70.3%
_STORE_FAST _LOAD_FAST_1 204,726,360 0.2% 70.5%
_BINARY_OP_MULTIPLY_FLOAT _GUARD_BOTH_FLOAT 204,052,800 0.2% 70.7%
_LOAD_FAST_2 _GUARD_BOTH_FLOAT 203,644,140 0.2% 70.8%
_BINARY_SUBSCR_STR_INT _LOAD_FAST_2 201,946,740 0.2% 71.0%
_RESUME_CHECK _CHECK_GLOBALS 201,314,422 0.2% 71.2%
_LOAD_ATTR _CHECK_VALIDITY_AND_SET_IP 201,253,609 0.2% 71.3%
_GUARD_BOTH_INT _BINARY_OP_SUBTRACT_INT 200,174,913 0.2% 71.5%
_GUARD_IS_TRUE_POP _LOAD_FAST_6 198,824,423 0.2% 71.6%
_STORE_FAST_1 _LOAD_FAST_0 197,411,380 0.2% 71.8%
_LOAD_FAST_4 _LOAD_CONST_INLINE_BORROW 196,219,220 0.2% 71.9%
_STORE_FAST_1 _STORE_FAST_2 194,203,240 0.2% 72.1%
_UNPACK_SEQUENCE_TWO_TUPLE _STORE_FAST_1 194,139,940 0.2% 72.3%
_CHECK_BUILTINS _LOAD_CONST_INLINE_BORROW_WITH_NULL 193,745,723 0.2% 72.4%
_BUILD_TUPLE _CHECK_VALIDITY 191,310,751 0.2% 72.6%
_GUARD_NOT_EXHAUSTED_TUPLE _ITER_NEXT_TUPLE 190,236,760 0.2% 72.7%
_BINARY_OP_MULTIPLY_FLOAT _EXIT_TRACE 190,022,880 0.2% 72.9%
_GUARD_IS_FALSE_POP _LOAD_FAST_3 189,057,938 0.2% 73.0%
_LOAD_FAST _LOAD_FAST_2 187,786,740 0.1% 73.2%
_BUILD_LIST _CHECK_VALIDITY 187,657,180 0.1% 73.3%
_IS_OP _GUARD_IS_TRUE_POP 186,597,540 0.1% 73.5%
_CHECK_BUILTINS _LOAD_CONST_INLINE_BORROW 184,752,800 0.1% 73.6%
_GUARD_BOTH_FLOAT _BINARY_OP_SUBTRACT_FLOAT 182,982,000 0.1% 73.8%
_SET_IP _POP_FRAME 182,906,060 0.1% 73.9%
_STORE_FAST_2 _SET_IP 179,819,340 0.1% 74.0%
_CHECK_VALIDITY_AND_SET_IP _CALL_METHOD_DESCRIPTOR_NOARGS 177,994,518 0.1% 74.2%
_LOAD_FAST_5 _SET_IP 176,186,875 0.1% 74.3%
_CALL_ISINSTANCE _CHECK_VALIDITY 174,969,800 0.1% 74.5%
_CHECK_VALIDITY _JUMP_TO_TOP 174,878,453 0.1% 74.6%
_LOAD_CONST_INLINE_BORROW _BINARY_SUBSCR_LIST_INT 173,368,153 0.1% 74.7%
_LOAD_ATTR_INSTANCE_VALUE_0 _LOAD_FAST_1 172,308,499 0.1% 74.9%
_SET_IP _CALL_ISINSTANCE 171,708,280 0.1% 75.0%
_CHECK_GLOBALS _LOAD_CONST_INLINE_WITH_NULL 168,917,039 0.1% 75.2%
_STORE_FAST_3 _LOAD_FAST_3 168,515,841 0.1% 75.3%
_LOAD_CONST_INLINE_BORROW _LOAD_FAST 166,009,860 0.1% 75.4%
_SET_IP _BUILD_LIST 165,738,859 0.1% 75.5%
_LOAD_CONST_INLINE_BORROW _STORE_FAST 164,171,320 0.1% 75.7%
_BINARY_SUBSCR_LIST_INT _LOAD_CONST_INLINE_BORROW 163,476,280 0.1% 75.8%
_LOAD_FAST _BINARY_OP_MULTIPLY_FLOAT 163,144,440 0.1% 75.9%
_LOAD_FAST_5 _GUARD_TYPE_VERSION 160,834,296 0.1% 76.1%
_LOAD_FAST_2 _TO_BOOL_BOOL 160,445,300 0.1% 76.2%
_STORE_FAST _LOAD_FAST_6 158,992,240 0.1% 76.3%
_CALL_METHOD_DESCRIPTOR_NOARGS _CHECK_VALIDITY 158,396,838 0.1% 76.4%
_BINARY_SUBSCR_LIST_INT _STORE_FAST 158,126,100 0.1% 76.6%
_LOAD_FAST_7 _SET_IP 156,425,215 0.1% 76.7%
_CHECK_VALIDITY_AND_SET_IP _LOAD_GLOBAL 155,172,060 0.1% 76.8%
_LOAD_GLOBAL _CHECK_VALIDITY 155,010,540 0.1% 76.9%
_STORE_FAST _LOAD_CONST_INLINE_BORROW 154,256,140 0.1% 77.1%
_COPY _BINARY_SUBSCR_LIST_INT 152,903,800 0.1% 77.2%
_LOAD_FAST_1 _GUARD_TYPE_VERSION 152,447,720 0.1% 77.3%
_LOAD_FAST_1 _LOAD_FAST_2 150,140,840 0.1% 77.4%
_STORE_FAST_6 _STORE_FAST_7 144,966,539 0.1% 77.5%
_GUARD_IS_FALSE_POP _CHECK_GLOBALS 142,970,224 0.1% 77.7%
_STORE_FAST_5 _LOAD_FAST_5 141,867,000 0.1% 77.8%
_LOAD_FAST_6 _LOAD_FAST 141,031,600 0.1% 77.9%
_CHECK_VALIDITY _LOAD_FAST_3 139,320,893 0.1% 78.0%
_LOAD_CONST_INLINE_BORROW_WITH_NULL _LOAD_FAST_1 134,694,560 0.1% 78.1%
_LOAD_FAST_5 _LOAD_CONST_INLINE 133,342,480 0.1% 78.2%
_LOAD_FAST_1 _CALL_TYPE_1 132,000,260 0.1% 78.3%
_CALL_TYPE_1 _STORE_FAST_5 132,000,000 0.1% 78.4%
_COPY _SET_IP 131,831,380 0.1% 78.5%
_TO_BOOL _CHECK_VALIDITY 131,621,759 0.1% 78.6%
_STORE_FAST_4 _LOAD_FAST_4 130,358,520 0.1% 78.7%
_LOAD_CONST_INLINE_BORROW _EXIT_TRACE 129,538,213 0.1% 78.8%
_STORE_FAST_7 _STORE_FAST 128,288,760 0.1% 78.9%
_STORE_FAST_5 _LOAD_FAST_3 127,732,520 0.1% 79.0%
_GUARD_IS_TRUE_POP _EXIT_TRACE 127,027,900 0.1% 79.1%
_PUSH_NULL _LOAD_FAST_0 126,747,616 0.1% 79.2%
_CHECK_GLOBALS _LOAD_CONST_INLINE 126,352,513 0.1% 79.3%
_TO_BOOL_INT _GUARD_IS_TRUE_POP 125,590,232 0.1% 79.4%
_CHECK_VALIDITY _LOAD_FAST_4 124,237,954 0.1% 79.5%
_LOAD_FAST _TO_BOOL_INT 123,928,500 0.1% 79.6%
_SET_IP _CALL_LEN 121,903,322 0.1% 79.7%
_BINARY_OP_SUBTRACT_INT _SET_IP 121,373,680 0.1% 79.8%
_ITER_NEXT_LIST _STORE_FAST 118,857,900 0.1% 79.9%
_ITER_NEXT_LIST _STORE_FAST_5 118,741,560 0.1% 80.0%
_CHECK_VALIDITY _UNPACK_SEQUENCE_TWO_TUPLE 118,390,080 0.1% 80.1%
_CHECK_VALIDITY_AND_SET_IP _POP_FRAME 118,164,340 0.1% 80.2%
_STORE_FAST_5 _STORE_FAST_6 117,718,590 0.1% 80.3%
_LOAD_FAST_1 _LOAD_FAST_4 117,389,920 0.1% 80.4%
_GUARD_IS_TRUE_POP _CHECK_GLOBALS 116,743,660 0.1% 80.5%
_SET_IP _BINARY_SUBSCR_DICT 116,503,380 0.1% 80.6%
_GUARD_IS_FALSE_POP _LOAD_FAST_2 115,268,052 0.1% 80.7%
_CALL_LEN _CHECK_VALIDITY 113,575,439 0.1% 80.8%
_LOAD_FAST_3 _GUARD_TYPE_VERSION 111,886,724 0.1% 80.9%
_BINARY_OP_ADD_FLOAT _SWAP 111,719,820 0.1% 80.9%
_BINARY_OP_ADD_INT _STORE_FAST 110,395,580 0.1% 81.0%
_CHECK_VALIDITY _STORE_FAST_4 108,393,839 0.1% 81.1%
_SET_IP _STORE_SLICE 108,249,300 0.1% 81.2%
_STORE_SLICE _CHECK_VALIDITY 108,071,340 0.1% 81.3%
_BINARY_OP_ADD_INT _LOAD_CONST_INLINE_BORROW 108,028,920 0.1% 81.4%
_BINARY_OP_MULTIPLY_FLOAT _BINARY_OP_ADD_FLOAT 106,167,460 0.1% 81.5%
_CHECK_VALIDITY _STORE_FAST_7 106,076,219 0.1% 81.5%
_SET_IP _BINARY_OP_MULTIPLY_INT 106,059,360 0.1% 81.6%
_BINARY_OP _LOAD_FAST_0 105,776,580 0.1% 81.7%
_SET_IP _BUILD_SLICE 104,807,000 0.1% 81.8%
_BUILD_SLICE _CHECK_VALIDITY_AND_SET_IP 104,807,000 0.1% 81.9%
_LOAD_FAST_3 _CHECK_GLOBALS 103,977,683 0.1% 82.0%
_LOAD_FAST_1 _UNPACK_SEQUENCE_TUPLE 102,015,240 0.1% 82.0%
_POP_TOP _LOAD_FAST_0 101,912,673 0.1% 82.1%
_STORE_FAST _LOAD_FAST_4 101,748,360 0.1% 82.2%
_UNPACK_SEQUENCE_TUPLE _STORE_FAST_5 101,269,140 0.1% 82.3%
_GUARD_IS_FALSE_POP _LOAD_CONST_INLINE_BORROW 101,074,840 0.1% 82.4%
_CHECK_VALIDITY _GUARD_BOTH_FLOAT 100,770,780 0.1% 82.4%
_BINARY_OP _SET_IP 98,382,771 0.1% 82.5%
_BINARY_SUBSCR_LIST_INT _SET_IP 98,036,040 0.1% 82.6%
_LOAD_FAST_4 _LOAD_FAST 97,939,980 0.1% 82.7%
_BINARY_SUBSCR_DICT _CHECK_VALIDITY 96,378,220 0.1% 82.8%
_STORE_FAST_0 _LOAD_FAST_0 96,233,601 0.1% 82.8%
_CHECK_VALIDITY _PUSH_NULL 96,107,180 0.1% 82.9%
_SET_IP _GET_ITER 96,001,942 0.1% 83.0%
_SET_IP _LIST_EXTEND 95,871,519 0.1% 83.1%
_CHECK_VALIDITY_AND_SET_IP _CALL_INTRINSIC_1 95,527,357 0.1% 83.1%
_CALL_INTRINSIC_1 _CHECK_VALIDITY 95,527,357 0.1% 83.2%
_LIST_EXTEND _CHECK_VALIDITY_AND_SET_IP 95,527,357 0.1% 83.3%
_ITER_NEXT_LIST _STORE_FAST_1 95,097,534 0.1% 83.4%
_STORE_FAST_4 _LOAD_FAST_1 94,720,020 0.1% 83.4%
_LOAD_FAST _PUSH_NULL 94,539,900 0.1% 83.5%
_SET_IP _GET_ANEXT 94,136,760 0.1% 83.6%
_GET_ANEXT _CHECK_VALIDITY 94,136,760 0.1% 83.7%
_BINARY_OP_ADD_INT _STORE_FAST_4 94,057,860 0.1% 83.7%
_SET_IP _TO_BOOL 94,014,919 0.1% 83.8%
_CHECK_VALIDITY_AND_SET_IP _BINARY_OP 91,962,588 0.1% 83.9%
_LOAD_ATTR_METHOD_WITH_VALUES _CHECK_VALIDITY_AND_SET_IP 90,886,532 0.1% 84.0%
_STORE_SUBSCR_DICT _CHECK_VALIDITY 89,729,240 0.1% 84.0%
_SET_IP _STORE_SUBSCR_DICT 89,352,720 0.1% 84.1%
_COPY _TO_BOOL_BOOL 88,503,260 0.1% 84.2%
_LIST_APPEND _JUMP_TO_TOP 87,956,800 0.1% 84.2%
_GET_ITER _CHECK_VALIDITY 87,952,440 0.1% 84.3%
_GUARD_IS_FALSE_POP _LOAD_FAST 86,627,840 0.1% 84.4%
_LOAD_CONST_INLINE_BORROW _COPY 86,422,960 0.1% 84.5%
_GUARD_IS_FALSE_POP _EXIT_TRACE 85,668,020 0.1% 84.5%
_GUARD_IS_FALSE_POP _LOAD_FAST_5 85,249,540 0.1% 84.6%
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST_0 83,969,542 0.1% 84.7%
_STORE_FAST_7 _LOAD_FAST_3 83,852,500 0.1% 84.7%
_LOAD_ATTR_INSTANCE_VALUE_0 _GUARD_BOTH_FLOAT 83,571,960 0.1% 84.8%
_CHECK_VALIDITY _STORE_FAST_0 83,072,901 0.1% 84.9%
_SET_IP _BINARY_OP_SUBTRACT_INT 82,654,660 0.1% 84.9%
_POP_FRAME _CHECK_VALIDITY_AND_SET_IP 82,142,040 0.1% 85.0%
_STORE_FAST_2 _LOAD_FAST_2 79,313,095 0.1% 85.1%
_BINARY_SUBSCR_LIST_INT _LOAD_FAST 78,536,760 0.1% 85.1%
_STORE_FAST_1 _SET_IP 78,197,020 0.1% 85.2%
_TO_BOOL_NONE _GUARD_IS_FALSE_POP 78,040,120 0.1% 85.2%
_LOAD_FAST_2 _LOAD_CONST_INLINE_BORROW 77,069,128 0.1% 85.3%
_LOAD_FAST_0 _LOAD_FAST 76,961,520 0.1% 85.4%
_CHECK_VALIDITY _UNPACK_SEQUENCE_TUPLE 76,396,220 0.1% 85.4%
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST_2 76,231,540 0.1% 85.5%
_LOAD_FAST_6 _SET_IP 75,845,515 0.1% 85.5%
_GUARD_TYPE_VERSION _GUARD_DORV_VALUES 75,408,865 0.1% 85.6%
_GUARD_DORV_VALUES _STORE_ATTR_INSTANCE_VALUE 75,147,745 0.1% 85.7%
_LOAD_CONST_INLINE_BORROW _LOAD_FAST_2 74,313,820 0.1% 85.7%
_LOAD_FAST_3 _TO_BOOL_NONE 73,650,180 0.1% 85.8%
_STORE_FAST _LOAD_FAST_7 73,601,700 0.1% 85.8%
_LOAD_FAST_3 _LOAD_CONST_INLINE_BORROW 73,537,098 0.1% 85.9%
_LOAD_FAST_7 _LOAD_FAST 73,230,503 0.1% 86.0%
_ITER_NEXT_TUPLE _STORE_FAST 72,688,740 0.1% 86.0%
_BINARY_OP_SUBTRACT_FLOAT _STORE_FAST 72,579,060 0.1% 86.1%
_GUARD_IS_FALSE_POP _JUMP_TO_TOP 72,541,965 0.1% 86.1%
_BINARY_OP_ADD_FLOAT _STORE_FAST 72,535,680 0.1% 86.2%
_LOAD_FAST_2 _LOAD_FAST_7 72,521,740 0.1% 86.2%
_LOAD_ATTR_INSTANCE_VALUE_0 _LOAD_FAST_0 72,503,795 0.1% 86.3%
_LOAD_ATTR_SLOT_0 _TO_BOOL_BOOL 72,429,404 0.1% 86.4%
_SET_IP _BINARY_SLICE 72,209,260 0.1% 86.4%
_STORE_FAST _SET_IP 72,111,240 0.1% 86.5%
_LOAD_FAST_5 _CHECK_GLOBALS 72,079,620 0.1% 86.5%
_LOAD_FAST_7 _LOAD_FAST_2 72,078,820 0.1% 86.6%
_STORE_FAST_1 _LOAD_FAST_1 71,839,797 0.1% 86.6%
_GUARD_IS_TRUE_POP _LOAD_FAST_1 71,386,691 0.1% 86.7%
_BUILD_TUPLE _CHECK_VALIDITY_AND_SET_IP 71,352,260 0.1% 86.8%
_STORE_FAST_3 _LOAD_FAST_2 70,183,595 0.1% 86.8%
_LOAD_FAST_3 _LOAD_FAST_5 69,998,500 0.1% 86.9%
_LOAD_FAST_1 _EXIT_TRACE 69,826,233 0.1% 86.9%
_CHECK_VALIDITY _CHECK_GLOBALS 69,437,306 0.1% 87.0%
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST_3 68,845,620 0.1% 87.0%
_LOAD_FAST _COPY 68,408,220 0.1% 87.1%
_RESUME_CHECK _LOAD_FAST_1 67,540,680 0.1% 87.1%
_STORE_ATTR_INSTANCE_VALUE _CHECK_VALIDITY 67,303,785 0.1% 87.2%
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS_2 67,232,420 0.1% 87.3%
_INIT_CALL_PY_EXACT_ARGS_2 _SAVE_RETURN_OFFSET 67,232,420 0.1% 87.3%
_BINARY_OP_SUBTRACT_FLOAT _LOAD_FAST_1 67,082,400 0.1% 87.4%
_ITER_NEXT_TUPLE _STORE_FAST_4 65,577,800 0.1% 87.4%
_CHECK_VALIDITY _STORE_FAST_1 65,312,380 0.1% 87.5%
_COMPARE_OP _CHECK_VALIDITY 64,794,840 0.1% 87.5%
_STORE_FAST_5 _LOAD_FAST_4 64,511,140 0.1% 87.6%
_LOAD_FAST_2 _LOAD_FAST_5 64,089,340 0.1% 87.6%
_PUSH_NULL _LOAD_FAST_2 64,065,260 0.1% 87.7%
_LOAD_FAST_4 _LOAD_FAST_0 63,772,400 0.1% 87.7%
_LOAD_FAST_2 _PUSH_NULL 63,270,712 0.1% 87.8%
_LOAD_FAST_1 _LOAD_FAST_0 63,116,280 0.1% 87.8%
_LOAD_FAST _EXIT_TRACE 62,407,760 0.0% 87.9%
_UNPACK_SEQUENCE_TUPLE _STORE_FAST 62,321,660 0.0% 87.9%
_LOAD_FAST_4 _LOAD_FAST_3 61,960,540 0.0% 88.0%
_CHECK_VALIDITY _STORE_FAST_2 61,511,887 0.0% 88.0%
_PUSH_NULL _LOAD_FAST_5 61,445,780 0.0% 88.1%
_BINARY_OP_MULTIPLY_INT _STORE_FAST 61,320,000 0.0% 88.1%
_BINARY_OP_MULTIPLY_FLOAT _LOAD_FAST 61,112,480 0.0% 88.2%
_POP_TOP _LOAD_FAST_1 61,092,660 0.0% 88.2%
_LOAD_CONST_INLINE _EXIT_TRACE 61,017,393 0.0% 88.3%
_GUARD_TYPE_VERSION _CHECK_ATTR_WITH_HINT 60,898,980 0.0% 88.3%
_CHECK_ATTR_WITH_HINT _LOAD_ATTR_WITH_HINT 60,898,980 0.0% 88.4%
_LOAD_CONST_INLINE _IS_OP 60,799,660 0.0% 88.4%
_SET_IP _CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS 60,360,368 0.0% 88.5%
_LOAD_FAST_4 _PUSH_NULL 59,852,400 0.0% 88.5%
_LOAD_CONST_INLINE _PUSH_NULL 59,793,536 0.0% 88.5%
_STORE_FAST_3 _JUMP_TO_TOP 59,593,260 0.0% 88.6%
_LOAD_CONST_INLINE_BORROW _LOAD_CONST_INLINE 59,522,820 0.0% 88.6%
_BINARY_OP_MULTIPLY_INT _LOAD_CONST_INLINE_BORROW 59,219,700 0.0% 88.7%
_CHECK_VALIDITY _STORE_FAST_5 58,850,220 0.0% 88.7%
_SET_IP _CALL_METHOD_DESCRIPTOR_FAST 58,591,914 0.0% 88.8%
_BINARY_OP _LOAD_CONST_INLINE_BORROW 58,533,300 0.0% 88.8%
_IS_OP _GUARD_IS_FALSE_POP 57,810,240 0.0% 88.9%
_GUARD_IS_TRUE_POP _LOAD_CONST_INLINE 57,695,360 0.0% 88.9%
_LOAD_FAST_2 _EXIT_TRACE 57,548,440 0.0% 89.0%
_BINARY_SUBSCR _CHECK_VALIDITY_AND_SET_IP 57,501,900 0.0% 89.0%
_LOAD_FAST_0 _COPY 57,340,408 0.0% 89.1%
_GUARD_TYPE_VERSION _STORE_ATTR_SLOT 57,184,694 0.0% 89.1%
_COPY _GUARD_TYPE_VERSION 57,039,148 0.0% 89.2%
_CHECK_VALIDITY_AND_SET_IP _LIST_APPEND 55,780,722 0.0% 89.2%
_BINARY_OP_SUBTRACT_INT _SWAP 55,726,933 0.0% 89.2%
_BINARY_OP_SUBTRACT_FLOAT _SWAP 55,700,820 0.0% 89.3%
_LOAD_FAST_5 _LOAD_FAST_1 55,537,440 0.0% 89.3%
_SET_IP _COMPARE_OP 55,478,220 0.0% 89.4%
_STORE_FAST_4 _STORE_FAST_5 55,409,260 0.0% 89.4%
_LOAD_CONST_INLINE _STORE_FAST 55,344,460 0.0% 89.5%
_STORE_FAST_4 _LOAD_FAST_3 55,162,160 0.0% 89.5%
_LOAD_FAST _LOAD_FAST_5 54,943,620 0.0% 89.5%
_RESUME_CHECK _LOAD_CONST_INLINE_BORROW 54,463,943 0.0% 89.6%
_LOAD_FAST_5 _LOAD_FAST_4 54,448,360 0.0% 89.6%
_GUARD_IS_TRUE_POP _LOAD_CONST_INLINE_WITH_NULL 54,411,000 0.0% 89.7%
_LOAD_FAST_6 _CHECK_GLOBALS 54,384,500 0.0% 89.7%
_LOAD_ATTR_INSTANCE_VALUE_0 _LOAD_FAST_4 54,113,560 0.0% 89.8%
_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS _CHECK_VALIDITY 54,091,120 0.0% 89.8%
_STORE_FAST _LOAD_CONST_INLINE_WITH_NULL 54,006,060 0.0% 89.9%
_CHECK_VALIDITY_AND_SET_IP _CONTAINS_OP 53,520,308 0.0% 89.9%
_LOAD_FAST_2 _LOAD_FAST 52,934,660 0.0% 89.9%
_STORE_FAST_3 _CHECK_GLOBALS 52,811,268 0.0% 90.0%
_UNPACK_SEQUENCE_TWO_TUPLE _STORE_FAST_4 52,377,420 0.0% 90.0%
_CHECK_GLOBALS _LOAD_CONST_INLINE_BORROW 51,414,923 0.0% 90.1%
_STORE_ATTR_SLOT _CHECK_VALIDITY 51,138,194 0.0% 90.1%
_LOAD_FAST_1 _BINARY_SUBSCR_LIST_INT 51,002,440 0.0% 90.1%
_BINARY_OP _SWAP 50,433,900 0.0% 90.2%
_LOAD_FAST_3 _PUSH_NULL 49,650,140 0.0% 90.2%
_LOAD_FAST _CHECK_GLOBALS 49,621,840 0.0% 90.3%
_LOAD_ATTR_WITH_HINT _CHECK_VALIDITY 49,181,760 0.0% 90.3%
_PUSH_NULL _LOAD_CONST_INLINE_BORROW 48,178,520 0.0% 90.3%
_LOAD_FAST_7 _PUSH_NULL 47,669,120 0.0% 90.4%
_STORE_FAST _JUMP_TO_TOP 47,062,620 0.0% 90.4%
_LOAD_CONST_INLINE _LOAD_FAST_0 47,005,920 0.0% 90.5%
_BINARY_SLICE _CHECK_VALIDITY 46,908,020 0.0% 90.5%
_LOAD_FAST_1 _LOAD_FAST_6 46,478,220 0.0% 90.5%
_LOAD_CONST_INLINE_BORROW _LOAD_FAST_1 46,176,120 0.0% 90.6%
_POP_TOP _LOAD_FAST 45,819,900 0.0% 90.6%
_GUARD_IS_TRUE_POP _LOAD_FAST 45,734,000 0.0% 90.6%
_PUSH_NULL _LOAD_FAST_4 45,637,300 0.0% 90.7%
_BINARY_OP_ADD_INT _LOAD_FAST_5 45,420,360 0.0% 90.7%
_LOAD_FAST _TO_BOOL_BOOL 45,270,880 0.0% 90.7%
_GUARD_KEYS_VERSION _LOAD_ATTR_NONDESCRIPTOR_WITH_VALUES 45,256,000 0.0% 90.8%
_LOAD_ATTR_INSTANCE_VALUE_0 _LOAD_FAST_2 45,103,440 0.0% 90.8%
_LOAD_FAST_4 _COPY 44,631,180 0.0% 90.9%
_PUSH_NULL _LOAD_CONST_INLINE 44,062,880 0.0% 90.9%
_CALL_METHOD_DESCRIPTOR_FAST _CHECK_VALIDITY 43,878,174 0.0% 90.9%
_ITER_NEXT_LIST _STORE_FAST_2 43,836,000 0.0% 91.0%
_BINARY_OP_ADD_INT _LOAD_FAST_0 43,125,160 0.0% 91.0%
_STORE_FAST_2 _LOAD_FAST_0 43,017,372 0.0% 91.0%
_BINARY_OP _BINARY_SUBSCR_LIST_INT 42,820,920 0.0% 91.1%
_ITER_NEXT_LIST _STORE_FAST_3 42,483,314 0.0% 91.1%
_LOAD_FAST _LOAD_FAST_7 42,385,780 0.0% 91.1%
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST 42,334,620 0.0% 91.2%
_CHECK_VALIDITY_AND_SET_IP _CALL_BUILTIN_FAST 42,237,000 0.0% 91.2%
_BINARY_OP_ADD_FLOAT _LOAD_FAST_0 41,992,140 0.0% 91.2%
_GUARD_IS_FALSE_POP _POP_TOP 41,590,740 0.0% 91.3%
_CALL_BUILTIN_FAST _CHECK_VALIDITY_AND_SET_IP 41,578,300 0.0% 91.3%
_LOAD_CONST_INLINE_BORROW _TO_BOOL_BOOL 41,380,800 0.0% 91.3%
_LOAD_FAST_4 _LOAD_FAST_1 41,340,900 0.0% 91.4%
_COMPARE_OP_FLOAT _CHECK_VALIDITY 40,939,500 0.0% 91.4%
_GUARD_BOTH_FLOAT _COMPARE_OP_FLOAT 40,939,500 0.0% 91.4%
_TO_BOOL_BOOL _EXIT_TRACE 40,718,080 0.0% 91.5%
_BINARY_OP _GUARD_BOTH_FLOAT 40,680,320 0.0% 91.5%
_PUSH_NULL _LOAD_FAST 40,521,420 0.0% 91.5%
_RESUME_CHECK _LOAD_CONST_INLINE 40,402,800 0.0% 91.6%
_BINARY_OP_ADD_FLOAT _STORE_FAST_3 40,248,000 0.0% 91.6%
_LOAD_FAST_5 _EXIT_TRACE 40,229,840 0.0% 91.6%
_SET_IP _GUARD_BOTH_FLOAT 39,838,320 0.0% 91.6%
_STORE_FAST_5 _LOAD_FAST_2 39,431,880 0.0% 91.7%
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS_3 39,272,600 0.0% 91.7%
_INIT_CALL_PY_EXACT_ARGS_3 _SAVE_RETURN_OFFSET 39,272,600 0.0% 91.7%
_POP_TOP _JUMP_TO_TOP 39,081,200 0.0% 91.8%
_BINARY_OP_MULTIPLY_FLOAT _BINARY_OP_SUBTRACT_FLOAT 38,471,400 0.0% 91.8%
_BINARY_OP_ADD_INT _SWAP 38,352,775 0.0% 91.8%
_LOAD_FAST _LOAD_CONST_INLINE 38,146,880 0.0% 91.9%
_SET_IP _LIST_APPEND 38,067,840 0.0% 91.9%
_STORE_ATTR _CHECK_VALIDITY 37,984,860 0.0% 91.9%
_LOAD_FAST_1 _LOAD_FAST_3 37,965,392 0.0% 92.0%
_CHECK_VALIDITY _COPY 37,818,980 0.0% 92.0%
_CHECK_VALIDITY_AND_SET_IP _TO_BOOL 37,606,840 0.0% 92.0%
_LOAD_FAST_4 _BINARY_SUBSCR_LIST_INT 37,438,980 0.0% 92.0%
_LOAD_FAST_3 _LOAD_CONST_INLINE 37,283,400 0.0% 92.1%
_TO_BOOL_LIST _GUARD_IS_TRUE_POP 36,948,837 0.0% 92.1%
_STORE_FAST_4 _LOAD_FAST_0 36,566,759 0.0% 92.1%
_LOAD_CONST_INLINE_BORROW _POP_FRAME 36,243,616 0.0% 92.2%
_IS_OP _EXIT_TRACE 36,054,120 0.0% 92.2%
_BINARY_OP_MULTIPLY_FLOAT _STORE_FAST 36,053,820 0.0% 92.2%
_STORE_FAST_2 _CHECK_GLOBALS 36,041,600 0.0% 92.2%
_CHECK_VALIDITY_AND_SET_IP _STORE_ATTR 36,004,520 0.0% 92.3%
_LOAD_FAST_6 _LOAD_FAST_7 35,712,623 0.0% 92.3%
_LOAD_FAST_3 _GUARD_BOTH_FLOAT 35,412,900 0.0% 92.3%
_LOAD_FAST_1 _BINARY_SUBSCR_TUPLE_INT 34,756,120 0.0% 92.4%
_LOAD_FAST_6 _LOAD_FAST_4 34,541,580 0.0% 92.4%
_LOAD_FAST_3 _LOAD_FAST_2 34,506,260 0.0% 92.4%
_GUARD_IS_TRUE_POP _POP_TOP 33,666,054 0.0% 92.4%
_STORE_FAST _CHECK_GLOBALS 33,180,720 0.0% 92.5%
_LOAD_FAST_3 _LOAD_FAST_1 32,642,240 0.0% 92.5%
_CHECK_VALIDITY _LOAD_FAST_7 32,375,190 0.0% 92.5%
_LOAD_ATTR_INSTANCE_VALUE_0 _LOAD_FAST_3 32,094,300 0.0% 92.5%
_ITER_NEXT_LIST _STORE_FAST_4 31,995,820 0.0% 92.6%
_GUARD_GLOBALS_VERSION _LOAD_GLOBAL_MODULE 31,871,200 0.0% 92.6%
_LOAD_CONST_INLINE_BORROW _BINARY_SUBSCR_TUPLE_INT 31,641,900 0.0% 92.6%
_COMPARE_OP_INT _CHECK_VALIDITY_AND_SET_IP 31,305,000 0.0% 92.6%
_GUARD_BOTH_INT _BINARY_OP_MULTIPLY_INT 31,291,980 0.0% 92.7%
_LOAD_FAST_5 _LOAD_FAST 31,182,960 0.0% 92.7%
_GUARD_IS_FALSE_POP _LOAD_FAST_4 31,088,960 0.0% 92.7%
_SET_IP _CALL_BUILTIN_CLASS 30,828,122 0.0% 92.7%
_SET_IP _CALL_STR_1 30,760,980 0.0% 92.8%
_BINARY_OP_ADD_INT _COPY 30,660,000 0.0% 92.8%
_CHECK_VALIDITY_AND_SET_IP _BUILD_TUPLE 30,317,740 0.0% 92.8%
_TO_BOOL_BOOL _UNARY_NOT 29,833,560 0.0% 92.8%
_STORE_FAST _LOAD_FAST_5 29,776,800 0.0% 92.9%
_BINARY_OP _STORE_FAST 29,651,151 0.0% 92.9%
_UNPACK_SEQUENCE_TWO_TUPLE _STORE_FAST 29,418,900 0.0% 92.9%
_STORE_FAST_7 _LOAD_FAST_6 29,222,260 0.0% 92.9%
_CALL_METHOD_DESCRIPTOR_FAST _CHECK_VALIDITY_AND_SET_IP 28,551,240 0.0% 93.0%
_LOAD_FAST_2 _GUARD_TYPE_VERSION 28,132,220 0.0% 93.0%
_UNPACK_SEQUENCE_TWO_TUPLE _STORE_FAST_3 28,094,900 0.0% 93.0%
_CHECK_VALIDITY_AND_SET_IP _BUILD_LIST 27,589,661 0.0% 93.0%
_CHECK_VALIDITY _LOAD_CONST_INLINE 27,277,633 0.0% 93.0%
_LOAD_ATTR_INSTANCE_VALUE_0 _LOAD_CONST_INLINE_BORROW 27,139,181 0.0% 93.1%
_LOAD_FAST_5 _COPY 27,075,060 0.0% 93.1%
_LOAD_FAST_5 _BINARY_SUBSCR_LIST_INT 26,751,180 0.0% 93.1%
_BINARY_SUBSCR_DICT _CHECK_VALIDITY_AND_SET_IP 26,593,640 0.0% 93.1%
_GUARD_IS_TRUE_POP _LOAD_FAST_7 26,591,859 0.0% 93.2%
_STORE_FAST_3 _LOAD_CONST_INLINE_BORROW 26,582,134 0.0% 93.2%
_LOAD_FAST_2 _LOAD_CONST_INLINE 26,531,280 0.0% 93.2%
_LOAD_ATTR_SLOT_0 _LOAD_FAST_0 26,402,700 0.0% 93.2%
_GUARD_IS_TRUE_POP _LOAD_FAST_3 26,236,760 0.0% 93.2%
_LOAD_ATTR_INSTANCE_VALUE_0 _COPY 26,098,220 0.0% 93.3%
_STORE_FAST_6 _LOAD_FAST_6 25,801,900 0.0% 93.3%
_CHECK_CALL_BOUND_METHOD_EXACT_ARGS _INIT_CALL_BOUND_METHOD_EXACT_ARGS 25,796,200 0.0% 93.3%
_INIT_CALL_BOUND_METHOD_EXACT_ARGS _CHECK_FUNCTION_EXACT_ARGS 25,796,200 0.0% 93.3%
_SET_IP _COMPARE_OP_INT 25,592,080 0.0% 93.3%
_CALL_STR_1 _CHECK_VALIDITY_AND_SET_IP 25,569,900 0.0% 93.4%
_BINARY_OP_MULTIPLY_FLOAT _LOAD_FAST_1 25,560,900 0.0% 93.4%
_LOAD_ATTR_INSTANCE_VALUE_0 _STORE_FAST_3 25,342,340 0.0% 93.4%
_BINARY_SLICE _CHECK_VALIDITY_AND_SET_IP 25,327,400 0.0% 93.4%
_PUSH_NULL _CHECK_GLOBALS 25,109,740 0.0% 93.4%
_UNARY_NOT _COPY 25,016,660 0.0% 93.5%
_GUARD_IS_TRUE_POP _LOAD_FAST_5 24,924,960 0.0% 93.5%
_LOAD_FAST_3 _BINARY_SUBSCR_LIST_INT 24,703,540 0.0% 93.5%
_LOAD_ATTR_NONDESCRIPTOR_WITH_VALUES _LOAD_FAST 24,697,200 0.0% 93.5%
_LOAD_CONST_INLINE _LOAD_CONST_INLINE 24,548,420 0.0% 93.5%
_CHECK_VALIDITY _SWAP 24,278,270 0.0% 93.6%
_POP_TOP _LOAD_FAST_3 24,035,380 0.0% 93.6%
_UNPACK_SEQUENCE_TUPLE _UNPACK_SEQUENCE_LIST 24,002,040 0.0% 93.6%
_GUARD_TYPE_VERSION _CHECK_ATTR_METHOD_LAZY_DICT 23,961,660 0.0% 93.6%

To assess their potency, I've taken the 326 UOp pairs that account for more that 0.1% of all UOp pairs and run them through the scoring script, which compares the length of the superinstruction in machine instructions to the sum of the lengths of its components. Here are those scores, by both percentage difference and absolute difference:

./python Tools/jit/score.py jit_stencils.h --output_mode "table" --sort_by "delta"

Scoring for top 326 UOp Pairs (x86_64)
UOps MC Count of Individual Ops MC Count when Compiled Together Percentage Delta
_TO_BOOL_INT / _GUARD_IS_TRUE_POP 331 207 62.54% -124
_TO_BOOL_NONE / _GUARD_IS_FALSE_POP 243 128 52.67% -115
_TO_BOOL_BOOL / _GUARD_IS_FALSE_POP 233 133 57.08% -100
_TO_BOOL_BOOL / _GUARD_IS_TRUE_POP 233 133 57.08% -100
_GUARD_IS_TRUE_POP / _EXIT_TRACE 201 106 52.74% -95
_GUARD_IS_FALSE_POP / _EXIT_TRACE 201 106 52.74% -95
_BINARY_OP_MULTIPLY_FLOAT / _BINARY_OP_ADD_FLOAT 506 413 81.62% -93
_COMPARE_OP_INT / _CHECK_VALIDITY 439 347 79.04% -92
_CHECK_MANAGED_OBJECT_HAS_VALUES / _LOAD_ATTR_INSTANCE_VALUE_0 325 237 72.92% -88
_LIST_APPEND / _JUMP_TO_TOP 265 184 69.43% -81
_JUMP_TO_TOP / _CHECK_VALIDITY 184 108 58.7% -76
_FOR_ITER_TIER_TWO / _CHECK_VALIDITY 360 294 81.67% -66
_ITER_CHECK_LIST / _GUARD_NOT_EXHAUSTED_LIST 183 121 66.12% -62
_ITER_CHECK_RANGE / _GUARD_NOT_EXHAUSTED_RANGE 171 109 63.74% -62
_ITER_CHECK_TUPLE / _GUARD_NOT_EXHAUSTED_TUPLE 183 121 66.12% -62
_CHECK_VALIDITY_AND_SET_IP / _CALL_METHOD_DESCRIPTOR_NOARGS 622 562 90.35% -60
_CHECK_VALIDITY / _ITER_CHECK_LIST 166 108 65.06% -58
_CHECK_GLOBALS / _CHECK_BUILTINS 172 114 66.28% -58
_CHECK_VALIDITY / _ITER_CHECK_TUPLE 166 108 65.06% -58
_RESUME_CHECK / _CHECK_GLOBALS 169 111 65.68% -58
_BINARY_SUBSCR_LIST_INT / _STORE_FAST 388 330 85.05% -58
_CHECK_VALIDITY / _CHECK_GLOBALS 162 104 64.2% -58
_GUARD_DORV_VALUES_INST_ATTR_FROM_DICT / _GUARD_KEYS_VERSION 232 176 75.86% -56
_CHECK_VALIDITY / _UNPACK_SEQUENCE_TUPLE 339 283 83.48% -56
_CHECK_FUNCTION_EXACT_ARGS / _CHECK_STACK_SPACE 313 259 82.75% -54
_CHECK_VALIDITY / _RESUME_CHECK 159 105 66.04% -54
_CALL_TYPE_1 / _STORE_FAST_5 422 368 87.2% -54
_BINARY_OP_ADD_INT / _STORE_FAST_1 312 260 83.33% -52
_CHECK_VALIDITY / _UNPACK_SEQUENCE_TWO_TUPLE 281 229 81.49% -52
_BINARY_OP_ADD_INT / _STORE_FAST 327 275 84.1% -52
_BINARY_OP_ADD_INT / _STORE_FAST_4 312 260 83.33% -52
_CHECK_VALIDITY_AND_SET_IP / _CHECK_FUNCTION_EXACT_ARGS 259 208 80.31% -51
_BINARY_OP_SUBTRACT_FLOAT / _STORE_FAST 357 307 85.99% -50
_BINARY_OP_ADD_FLOAT / _STORE_FAST 357 307 85.99% -50
_BINARY_SUBSCR_STR_INT / _STORE_FAST_7 540 491 90.93% -49
_CALL_METHOD_DESCRIPTOR_NOARGS / _CHECK_VALIDITY 608 562 92.43% -46
_STORE_FAST / _STORE_FAST 208 163 78.37% -45
_CALL_BUILTIN_O / _CHECK_VALIDITY 593 548 92.41% -45
_STORE_FAST_1 / _STORE_FAST_2 178 133 74.72% -45
_IS_OP / _GUARD_IS_TRUE_POP 313 268 85.62% -45
_STORE_FAST_6 / _STORE_FAST_7 184 139 75.54% -45
_STORE_FAST_7 / _STORE_FAST 199 154 77.39% -45
_STORE_FAST_5 / _STORE_FAST_6 178 133 74.72% -45
_BINARY_SUBSCR_DICT / _CHECK_VALIDITY 409 364 89.0% -45
_STORE_SUBSCR_LIST_INT / _CHECK_VALIDITY 377 333 88.33% -44
_LOAD_FAST_1 / _BINARY_SUBSCR_STR_INT 477 435 91.19% -42
_CALL_ISINSTANCE / _CHECK_VALIDITY 532 490 92.11% -42
_CALL_LEN / _CHECK_VALIDITY 472 430 91.1% -42
_COPY / _BINARY_SUBSCR_LIST_INT 338 297 87.87% -41
_STORE_SUBSCR_DICT / _CHECK_VALIDITY 338 298 88.17% -40
_LOAD_FAST_1 / _UNPACK_SEQUENCE_TUPLE 295 259 87.8% -36
_GUARD_BOTH_INT / _BINARY_OP_ADD_INT 348 318 91.38% -30
_GUARD_BOTH_INT / _BINARY_OP_SUBTRACT_INT 348 318 91.38% -30
_CALL_BUILTIN_FAST / _CHECK_VALIDITY 590 561 95.08% -29
_STORE_FAST_1 / _JUMP_TO_TOP 197 168 85.28% -29
_GUARD_BOTH_UNICODE / _COMPARE_OP_STR 335 307 91.64% -28
_BINARY_SUBSCR_LIST_INT / _LOAD_CONST_INLINE_BORROW 314 286 91.08% -28
_BINARY_SUBSCR_LIST_INT / _LOAD_FAST 330 302 91.52% -28
_LOAD_CONST_INLINE_BORROW / _BINARY_SUBSCR_LIST_INT 314 287 91.4% -27
_UNPACK_SEQUENCE_TUPLE / _STORE_FAST_5 352 325 92.33% -27
_GUARD_NOT_EXHAUSTED_LIST / _ITER_NEXT_LIST 149 124 83.22% -25
_GUARD_NOT_EXHAUSTED_TUPLE / _ITER_NEXT_TUPLE 146 121 82.88% -25
_LOAD_CONST_INLINE_BORROW / _STORE_FAST 134 109 81.34% -25
_ITER_NEXT_LIST / _STORE_FAST 160 135 84.38% -25
_ITER_NEXT_LIST / _STORE_FAST_5 145 120 82.76% -25
_ITER_NEXT_LIST / _STORE_FAST_1 145 120 82.76% -25
_ITER_NEXT_TUPLE / _STORE_FAST 157 132 84.08% -25
_ITER_NEXT_TUPLE / _STORE_FAST_4 142 117 82.39% -25
_CHECK_VALIDITY / _GUARD_IS_FALSE_POP 194 171 88.14% -23
_PUSH_FRAME / _CHECK_VALIDITY 144 121 84.03% -23
_CHECK_VALIDITY / _TO_BOOL_BOOL 191 168 87.96% -23
_CHECK_VALIDITY / _GUARD_IS_TRUE_POP 194 171 88.14% -23
_CHECK_VALIDITY / _EXIT_TRACE 159 136 85.53% -23
_ITER_NEXT_LIST / _UNPACK_SEQUENCE_TWO_TUPLE 261 238 91.19% -23
_CHECK_VALIDITY / _GUARD_BOTH_FLOAT 201 178 88.56% -23
_GUARD_IS_TRUE_POP / _JUMP_TO_TOP 226 204 90.27% -22
_GUARD_IS_FALSE_POP / _JUMP_TO_TOP 226 204 90.27% -22
_LOAD_FAST / _BINARY_OP_MULTIPLY_FLOAT 299 278 92.98% -21
_GUARD_TYPE_VERSION / _CHECK_MANAGED_OBJECT_HAS_VALUES 253 233 92.09% -20
_GUARD_TYPE_VERSION / _GUARD_DORV_VALUES_INST_ATTR_FROM_DICT 253 233 92.09% -20
_GUARD_BOTH_FLOAT / _BINARY_OP_MULTIPLY_FLOAT 378 358 94.71% -20
_STORE_FAST / _LOAD_FAST 150 130 86.67% -20
_GUARD_BOTH_FLOAT / _BINARY_OP_ADD_FLOAT 378 358 94.71% -20
_GUARD_BOTH_FLOAT / _BINARY_OP_SUBTRACT_FLOAT 378 358 94.71% -20
_STORE_FAST / _LOAD_CONST_INLINE_BORROW 134 114 85.07% -20
_STORE_SLICE / _CHECK_VALIDITY 392 372 94.9% -20
_BINARY_OP / _LOAD_FAST_0 269 249 92.57% -20
_BINARY_OP_SUBTRACT_FLOAT / _LOAD_FAST_1 285 265 92.98% -20
_GUARD_IS_FALSE_POP / _LOAD_FAST_7 153 134 87.58% -19
_GUARD_IS_FALSE_POP / _LOAD_FAST_1 150 131 87.33% -19
_GUARD_IS_FALSE_POP / _LOAD_FAST_0 150 131 87.33% -19
_GUARD_IS_TRUE_POP / _LOAD_FAST_0 150 131 87.33% -19
_BINARY_OP_MULTIPLY_FLOAT / _GUARD_BOTH_FLOAT 378 359 94.97% -19
_BINARY_SUBSCR_STR_INT / _LOAD_FAST_2 477 458 96.02% -19
_GUARD_IS_TRUE_POP / _LOAD_FAST_6 150 131 87.33% -19
_GUARD_IS_FALSE_POP / _LOAD_FAST_3 150 131 87.33% -19
_GUARD_IS_FALSE_POP / _LOAD_FAST_2 150 131 87.33% -19
_GUARD_IS_FALSE_POP / _LOAD_CONST_INLINE_BORROW 148 129 87.16% -19
_GUARD_IS_FALSE_POP / _LOAD_FAST 164 145 88.41% -19
_GUARD_IS_FALSE_POP / _LOAD_FAST_5 150 131 87.33% -19
_GUARD_IS_TRUE_POP / _LOAD_FAST_1 150 131 87.33% -19
_BINARY_OP_ADD_FLOAT / _SWAP 307 289 94.14% -18
_BINARY_OP_ADD_INT / _LOAD_CONST_INLINE_BORROW 253 235 92.89% -18
_LOAD_FAST_0 / _GUARD_TYPE_VERSION 153 136 88.89% -17
_CHECK_VALIDITY_AND_SET_IP / _LOAD_ATTR 405 388 95.8% -17
_COPY / _COPY 108 91 84.26% -17
_SWAP / _SWAP 108 91 84.26% -17
_POP_FRAME / _CHECK_VALIDITY 187 170 90.91% -17
_LOAD_FAST / _GUARD_BOTH_FLOAT 171 154 90.06% -17
_LOAD_FAST_2 / _GUARD_BOTH_FLOAT 157 140 89.17% -17
_LOAD_FAST_5 / _GUARD_TYPE_VERSION 153 136 88.89% -17
_LOAD_FAST_2 / _TO_BOOL_BOOL 147 130 88.44% -17
_LOAD_FAST_1 / _GUARD_TYPE_VERSION 153 136 88.89% -17
_LOAD_FAST_3 / _GUARD_TYPE_VERSION 153 136 88.89% -17
_COPY / _TO_BOOL_BOOL 169 152 89.94% -17
_LOAD_CONST_INLINE_BORROW / _COPY 84 67 79.76% -17
_POP_FRAME / _CHECK_VALIDITY_AND_SET_IP 201 184 91.54% -17
_LOAD_FAST / _COPY 100 83 83.0% -17
_LOAD_FAST_7 / _LOAD_CONST_INLINE_BORROW 65 49 75.38% -16
_LOAD_FAST_0 / _LOAD_FAST_1 64 48 75.0% -16
_LOAD_FAST_1 / _LOAD_CONST_INLINE_BORROW 62 46 74.19% -16
_LOAD_FAST_5 / _LOAD_CONST_INLINE_BORROW 62 46 74.19% -16
_LOAD_FAST_7 / _LOAD_FAST_3 67 51 76.12% -16
_LOAD_FAST / _LOAD_CONST_INLINE_BORROW 76 60 78.95% -16
_LOAD_CONST_INLINE_WITH_NULL / _LOAD_FAST_5 78 62 79.49% -16
_GUARD_KEYS_VERSION / _LOAD_ATTR_METHOD_WITH_VALUES 149 133 89.26% -16
_LOAD_CONST_INLINE_BORROW / _LOAD_CONST_INLINE_BORROW 60 44 73.33% -16
_LOAD_FAST / _LOAD_FAST 92 76 82.61% -16
_GUARD_TYPE_VERSION / _LOAD_ATTR_METHOD_NO_DICT 170 154 90.59% -16
_GUARD_TYPE_VERSION / _LOAD_ATTR_SLOT_0 309 293 94.82% -16
_LOAD_FAST_6 / _LOAD_CONST_INLINE_BORROW 62 46 74.19% -16
_GUARD_IS_FALSE_POP / _LOAD_CONST_INLINE_WITH_NULL 164 148 90.24% -16
_LOAD_FAST_3 / _LOAD_FAST_4 64 48 75.0% -16
_LOAD_CONST_INLINE_WITH_NULL / _LOAD_FAST_1 78 62 79.49% -16
_LOAD_FAST_2 / _LOAD_FAST_3 64 48 75.0% -16
_CHECK_STACK_SPACE / _INIT_CALL_PY_EXACT_ARGS_4 637 621 97.49% -16
_LOAD_FAST_1 / _LOAD_FAST 78 62 79.49% -16
_CHECK_STACK_SPACE / _INIT_CALL_PY_EXACT_ARGS_1 594 578 97.31% -16
_LOAD_FAST_4 / _LOAD_CONST_INLINE_BORROW 62 46 74.19% -16
_UNPACK_SEQUENCE_TWO_TUPLE / _STORE_FAST_1 294 278 94.56% -16
_LOAD_FAST / _LOAD_FAST_2 78 62 79.49% -16
_LOAD_CONST_INLINE_BORROW / _LOAD_FAST 76 60 78.95% -16
_LOAD_FAST_1 / _LOAD_FAST_2 64 48 75.0% -16
_LOAD_FAST_6 / _LOAD_FAST 78 62 79.49% -16
_LOAD_CONST_INLINE_BORROW_WITH_NULL / _LOAD_FAST_1 70 54 77.14% -16
_LOAD_FAST_5 / _LOAD_CONST_INLINE 70 54 77.14% -16
_PUSH_NULL / _LOAD_FAST_0 56 40 71.43% -16
_LOAD_FAST / _TO_BOOL_INT 259 243 93.82% -16
_LOAD_FAST_1 / _LOAD_FAST_4 64 48 75.0% -16
_LOAD_FAST_4 / _LOAD_FAST 78 62 79.49% -16
_LOAD_FAST / _PUSH_NULL 70 54 77.14% -16
_LOAD_CONST_INLINE_WITH_NULL / _LOAD_FAST_0 78 62 79.49% -16
_LOAD_FAST_2 / _LOAD_CONST_INLINE_BORROW 62 46 74.19% -16
_LOAD_FAST_0 / _LOAD_FAST 78 62 79.49% -16
_LOAD_CONST_INLINE_WITH_NULL / _LOAD_FAST_2 78 62 79.49% -16
_LOAD_CONST_INLINE_BORROW / _LOAD_FAST_2 62 46 74.19% -16
_LOAD_FAST_3 / _LOAD_CONST_INLINE_BORROW 62 46 74.19% -16
_LOAD_FAST_7 / _LOAD_FAST 81 65 80.25% -16
_LOAD_FAST_2 / _LOAD_FAST_7 67 51 76.12% -16
_LOAD_FAST_7 / _LOAD_FAST_2 67 51 76.12% -16
_LOAD_FAST_3 / _LOAD_FAST_5 64 48 75.0% -16
_LOAD_CONST_INLINE_WITH_NULL / _LOAD_FAST_3 78 62 79.49% -16
_CHECK_STACK_SPACE / _INIT_CALL_PY_EXACT_ARGS_2 603 587 97.35% -16
_LOAD_FAST_2 / _LOAD_FAST_5 64 48 75.0% -16
_PUSH_NULL / _LOAD_FAST_2 56 40 71.43% -16
_LOAD_FAST_4 / _LOAD_FAST_0 64 48 75.0% -16
_LOAD_FAST_2 / _PUSH_NULL 56 40 71.43% -16
_LOAD_FAST_1 / _LOAD_FAST_0 64 48 75.0% -16
_GUARD_BOTH_INT / _COMPARE_OP_INT 488 473 96.93% -15
_CHECK_VALIDITY_AND_SET_IP / _LOAD_GLOBAL 458 443 96.72% -15
_LOAD_CONST_INLINE_BORROW / _SET_IP 57 44 77.19% -13
_SET_IP / _GUARD_BOTH_INT 152 139 91.45% -13
_SET_IP / _GUARD_TYPE_VERSION 148 135 91.22% -13
_SET_IP / _CONTAINS_OP 287 274 95.47% -13
_LOAD_FAST_1 / _SET_IP 59 46 77.97% -13
_CHECK_VALIDITY / _LOAD_FAST_0 108 95 87.96% -13
_LOAD_FAST_3 / _SET_IP 59 46 77.97% -13
_CHECK_VALIDITY / _LOAD_FAST_1 108 95 87.96% -13
_LOAD_FAST_0 / _SET_IP 59 46 77.97% -13
_SAVE_RETURN_OFFSET / _PUSH_FRAME 95 82 86.32% -13
_CHECK_VALIDITY / _SET_IP 103 90 87.38% -13
_SET_IP / _GUARD_BOTH_UNICODE 152 139 91.45% -13
_SET_IP / _COMPARE_OP_STR 237 224 94.51% -13
_SET_IP / _BINARY_SUBSCR 250 237 94.8% -13
_LOAD_FAST / _SET_IP 73 60 82.19% -13
_CHECK_VALIDITY / _LOAD_FAST 122 109 89.34% -13
_LOAD_FAST_2 / _SET_IP 59 46 77.97% -13
_LOAD_FAST_4 / _SET_IP 59 46 77.97% -13
_SET_IP / _CHECK_FUNCTION_EXACT_ARGS 196 183 93.37% -13
_CHECK_VALIDITY / _LOAD_CONST_INLINE_BORROW 106 93 87.74% -13
_SET_IP / _ITER_CHECK_RANGE 117 104 88.89% -13
_SET_IP / _LOAD_ATTR 342 329 96.2% -13
_LOAD_ATTR_METHOD_WITH_VALUES / _CHECK_VALIDITY 125 112 89.6% -13
_SET_IP / _BINARY_OP 264 251 95.08% -13
_RESUME_CHECK / _LOAD_FAST_0 115 102 88.7% -13
_BINARY_OP_ADD_INT / _SET_IP 250 237 94.8% -13
_SET_IP / _BINARY_OP_ADD_INT 250 237 94.8% -13
_CHECK_VALIDITY / _LOAD_FAST_2 108 95 87.96% -13
_CHECK_BUILTINS / _LOAD_CONST_INLINE_WITH_NULL 132 119 90.15% -13
_LOAD_ATTR_INSTANCE_VALUE_0 / _SET_IP 220 207 94.09% -13
_LOAD_CONST_INLINE / _SET_IP 65 52 80.0% -13
_SWAP / _SET_IP 81 68 83.95% -13
_SET_IP / _LOAD_DEREF 181 168 92.82% -13
_LOAD_ATTR_SLOT_0 / _SET_IP 215 202 93.95% -13
_INIT_CALL_PY_EXACT_ARGS_4 / _SAVE_RETURN_OFFSET 520 507 97.5% -13
_CHECK_VALIDITY / _LOAD_FAST_6 108 95 87.96% -13
_SET_IP / _STORE_SUBSCR_LIST_INT 328 315 96.04% -13
_LOAD_ATTR_METHOD_NO_DICT / _CHECK_VALIDITY_AND_SET_IP 139 126 90.65% -13
_SET_IP / _BUILD_TUPLE 207 194 93.72% -13
_CHECK_VALIDITY / _LOAD_FAST_5 108 95 87.96% -13
_INIT_CALL_PY_EXACT_ARGS_1 / _SAVE_RETURN_OFFSET 477 464 97.27% -13
_CHECK_STACK_SPACE / _INIT_CALL_PY_EXACT_ARGS_0 596 583 97.82% -13
_INIT_CALL_PY_EXACT_ARGS_0 / _SAVE_RETURN_OFFSET 479 466 97.29% -13
_LOAD_ATTR_METHOD_NO_DICT / _CHECK_VALIDITY 125 112 89.6% -13
_SET_IP / _FOR_ITER_TIER_TWO 311 298 95.82% -13
_SET_IP / _STORE_SUBSCR 297 284 95.62% -13
_CHECK_BUILTINS / _LOAD_CONST_INLINE_BORROW_WITH_NULL 124 111 89.52% -13
_CHECK_BUILTINS / _LOAD_CONST_INLINE_BORROW 116 103 88.79% -13
_SET_IP / _POP_FRAME 138 125 90.58% -13
_STORE_FAST_2 / _SET_IP 116 103 88.79% -13
_LOAD_FAST_5 / _SET_IP 59 46 77.97% -13
_LOAD_ATTR_INSTANCE_VALUE_0 / _LOAD_FAST_1 225 212 94.22% -13
_CHECK_GLOBALS / _LOAD_CONST_INLINE_WITH_NULL 132 119 90.15% -13
_SET_IP / _BUILD_LIST 207 194 93.72% -13
_LOAD_FAST_7 / _SET_IP 62 49 79.03% -13
_GUARD_IS_FALSE_POP / _CHECK_GLOBALS 204 191 93.63% -13
_CHECK_VALIDITY / _LOAD_FAST_3 108 95 87.96% -13
_COPY / _SET_IP 81 68 83.95% -13
_LOAD_CONST_INLINE_BORROW / _EXIT_TRACE 113 100 88.5% -13
_CHECK_GLOBALS / _LOAD_CONST_INLINE 124 111 89.52% -13
_CHECK_VALIDITY / _LOAD_FAST_4 108 95 87.96% -13
_BINARY_OP_SUBTRACT_INT / _SET_IP 250 237 94.8% -13
_GUARD_IS_TRUE_POP / _CHECK_GLOBALS 204 191 93.63% -13
_SET_IP / _BINARY_SUBSCR_DICT 360 347 96.39% -13
_SET_IP / _STORE_SLICE 343 330 96.21% -13
_SET_IP / _BINARY_OP_MULTIPLY_INT 250 237 94.8% -13
_SET_IP / _BUILD_SLICE 381 368 96.59% -13
_LOAD_FAST_3 / _CHECK_GLOBALS 118 105 88.98% -13
_BINARY_OP / _SET_IP 264 251 95.08% -13
_CHECK_VALIDITY / _PUSH_NULL 100 87 87.0% -13
_SET_IP / _GET_ITER 192 179 93.23% -13
_SET_IP / _LIST_EXTEND 355 342 96.34% -13
_SET_IP / _GET_ANEXT 390 377 96.67% -13
_SET_IP / _TO_BOOL 210 197 93.81% -13
_LOAD_ATTR_METHOD_WITH_VALUES / _CHECK_VALIDITY_AND_SET_IP 139 126 90.65% -13
_SET_IP / _STORE_SUBSCR_DICT 289 276 95.5% -13
_SET_IP / _BINARY_OP_SUBTRACT_INT 250 237 94.8% -13
_STORE_FAST_1 / _SET_IP 116 103 88.79% -13
_LOAD_FAST_6 / _SET_IP 59 46 77.97% -13
_LOAD_ATTR_INSTANCE_VALUE_0 / _LOAD_FAST_0 225 212 94.22% -13
_SET_IP / _BINARY_SLICE 278 265 95.32% -13
_STORE_FAST / _SET_IP 131 118 90.08% -13
_LOAD_FAST_5 / _CHECK_GLOBALS 118 105 88.98% -13
_LOAD_FAST_1 / _EXIT_TRACE 115 102 88.7% -13
_RESUME_CHECK / _LOAD_FAST_1 115 102 88.7% -13
_INIT_CALL_PY_EXACT_ARGS_2 / _SAVE_RETURN_OFFSET 486 473 97.33% -13
_LOAD_ATTR / _CHECK_VALIDITY 391 379 96.93% -12
_BUILD_TUPLE / _CHECK_VALIDITY 256 244 95.31% -12
_BINARY_OP_MULTIPLY_FLOAT / _EXIT_TRACE 336 324 96.43% -12
_BUILD_LIST / _CHECK_VALIDITY 256 244 95.31% -12
_GUARD_TYPE_VERSION / _GUARD_DORV_VALUES 201 189 94.03% -12
_BUILD_TUPLE / _CHECK_VALIDITY_AND_SET_IP 270 258 95.56% -12
_CONTAINS_OP / _CHECK_VALIDITY 336 325 96.73% -11
_BINARY_SUBSCR / _CHECK_VALIDITY 299 288 96.32% -11
_SET_IP / _CALL_BUILTIN_FAST 541 530 97.97% -11
_SET_IP / _CALL_BUILTIN_O 544 533 97.98% -11
_CHECK_VALIDITY_AND_SET_IP / _BINARY_SUBSCR 313 302 96.49% -11
_CHECK_VALIDITY_AND_SET_IP / _BINARY_OP 327 316 96.64% -11
_LOAD_ATTR_INSTANCE_VALUE_0 / _GUARD_BOTH_FLOAT 318 307 96.54% -11
_CHECK_VALIDITY / _STORE_FAST_0 165 154 93.33% -11
_LOAD_FAST_3 / _TO_BOOL_NONE 157 146 92.99% -11
_COMPARE_OP / _CHECK_VALIDITY 412 401 97.33% -11
_CHECK_VALIDITY_AND_SET_IP / _POP_FRAME 201 191 95.02% -10
_CHECK_VALIDITY / _STORE_FAST 180 171 95.0% -9
_ITER_NEXT_RANGE / _CHECK_VALIDITY 204 195 95.59% -9
_LOAD_DEREF / _CHECK_VALIDITY 230 221 96.09% -9
_CHECK_VALIDITY / _STORE_FAST_6 165 156 94.55% -9
_CHECK_VALIDITY / _STORE_FAST_3 165 156 94.55% -9
_CHECK_VALIDITY / _POP_TOP 152 143 94.08% -9
_SET_IP / _CALL_ISINSTANCE 483 474 98.14% -9
_LOAD_FAST_1 / _CALL_TYPE_1 365 356 97.53% -9
_SET_IP / _CALL_LEN 423 414 97.87% -9
_CHECK_VALIDITY / _STORE_FAST_4 165 156 94.55% -9
_CHECK_VALIDITY / _STORE_FAST_7 171 162 94.74% -9
_BINARY_SUBSCR_LIST_INT / _SET_IP 311 302 97.11% -9
_CHECK_VALIDITY_AND_SET_IP / _CALL_INTRINSIC_1 276 267 96.74% -9
_CALL_INTRINSIC_1 / _CHECK_VALIDITY 262 253 96.56% -9
_GET_ITER / _CHECK_VALIDITY 241 232 96.27% -9
_CHECK_VALIDITY / _STORE_FAST_1 165 156 94.55% -9
_GUARD_NOT_EXHAUSTED_RANGE / _ITER_NEXT_RANGE 209 201 96.17% -8
_STORE_FAST / _LOAD_FAST_0 136 128 94.12% -8
_STORE_FAST / _LOAD_FAST_1 136 128 94.12% -8
_LOAD_ATTR / _CHECK_VALIDITY_AND_SET_IP 405 397 98.02% -8
_STORE_FAST_1 / _LOAD_FAST_0 121 113 93.39% -8
_STORE_FAST_3 / _LOAD_FAST_3 121 113 93.39% -8
_STORE_FAST / _LOAD_FAST_6 136 128 94.12% -8
_STORE_FAST_5 / _LOAD_FAST_5 121 113 93.39% -8
_STORE_FAST_4 / _LOAD_FAST_4 121 113 93.39% -8
_STORE_FAST_5 / _LOAD_FAST_3 121 113 93.39% -8
_POP_TOP / _LOAD_FAST_0 108 100 92.59% -8
_STORE_FAST / _LOAD_FAST_4 136 128 94.12% -8
_STORE_FAST_0 / _LOAD_FAST_0 121 113 93.39% -8
_STORE_FAST_4 / _LOAD_FAST_1 121 113 93.39% -8
_STORE_FAST_7 / _LOAD_FAST_3 127 119 93.7% -8
_STORE_FAST_2 / _LOAD_FAST_2 121 113 93.39% -8
_STORE_FAST_1 / _LOAD_FAST_1 121 113 93.39% -8
_STORE_FAST_3 / _LOAD_FAST_2 121 113 93.39% -8
_STORE_FAST_5 / _LOAD_FAST_4 121 113 93.39% -8
_GET_ANEXT / _CHECK_VALIDITY 439 432 98.41% -7
_GUARD_DORV_VALUES / _STORE_ATTR_INSTANCE_VALUE 264 257 97.35% -7
_COMPARE_OP_STR / _CHECK_VALIDITY 286 280 97.9% -6
_STORE_FAST_7 / _LOAD_FAST_7 130 125 96.15% -5
_STORE_FAST / _LOAD_FAST_7 139 134 96.4% -5
_STORE_SUBSCR / _CHECK_VALIDITY 346 342 98.84% -4
_CHECK_VALIDITY / _JUMP_TO_TOP 184 180 97.83% -4
_LOAD_GLOBAL / _CHECK_VALIDITY 444 443 99.77% -1
_STORE_FAST_6 / _LOAD_CONST_INLINE_WITH_NULL 135 136 100.74% 1
_CHECK_VALIDITY / _IS_OP 271 273 100.74% 2
_TO_BOOL / _CHECK_VALIDITY 259 262 101.16% 3
_BUILD_SLICE / _CHECK_VALIDITY_AND_SET_IP 444 453 102.03% 9
_LIST_EXTEND / _CHECK_VALIDITY_AND_SET_IP 418 435 104.07% 17
_START_EXECUTOR / _CHECK_VALIDITY 162 180 111.11% 18
_START_EXECUTOR / _CHECK_VALIDITY_AND_SET_IP 176 194 110.23% 18
_LOAD_ATTR_INSTANCE_VALUE_0 / _TO_BOOL_BOOL 308 326 105.84% 18
_LOAD_ATTR_SLOT_0 / _TO_BOOL_BOOL 303 321 105.94% 18
_STORE_ATTR_INSTANCE_VALUE / _CHECK_VALIDITY 260 279 107.31% 19

Of course, the specifics of exactly how much shorter a superinstruction is will vary with implementation details and by platform.


So, how much shorter is "short enough to be worth it?" Should superinstructions be weighted by their quality and prevalence? Probably some experimentation is needed. And of course there are lots of other ways candidate pairs/sequences could be identified, like @Fidget-Spinner's method above.

I tried a couple arbitrary testing points locally:


Using the 83 pairs* that reduce the machine code count by 20 or more: ~2% Faster (Local)

Benchmarks with tag 'apps':

Benchmark main-jit uop-stats-20-mc-or-better-pairs-83
2to3 297 ms 292 ms: 1.02x faster
chameleon 6.51 ms 6.38 ms: 1.02x faster
docutils 2.74 sec 2.65 sec: 1.03x faster
Geometric mean (ref) 1.01x faster

Benchmark hidden because not significant (2): html5lib, tornado_http

Benchmarks with tag 'asyncio':

Benchmark main-jit uop-stats-20-mc-or-better-pairs-83
async_tree_eager_tg 83.6 ms 77.0 ms: 1.09x faster
async_tree_io 1.12 sec 1.06 sec: 1.05x faster
async_tree_memoization_tg 551 ms 530 ms: 1.04x faster
async_tree_eager_memoization 275 ms 264 ms: 1.04x faster
async_tree_io_tg 1.11 sec 1.08 sec: 1.03x faster
async_tree_eager_memoization_tg 206 ms 201 ms: 1.02x faster
async_tree_none_tg 423 ms 415 ms: 1.02x faster
async_tree_memoization 541 ms 533 ms: 1.02x faster
async_tree_cpu_io_mixed_tg 733 ms 741 ms: 1.01x slower
async_tree_eager_cpu_io_mixed 473 ms 478 ms: 1.01x slower
async_tree_eager_io 1.06 sec 1.09 sec: 1.02x slower
async_tree_eager_cpu_io_mixed_tg 409 ms 420 ms: 1.03x slower
Geometric mean (ref) 1.01x faster

Benchmark hidden because not significant (4): async_tree_cpu_io_mixed, async_tree_eager, async_tree_none, async_tree_eager_io_tg

Benchmarks with tag 'math':

Benchmark main-jit uop-stats-20-mc-or-better-pairs-83
nbody 85.3 ms 78.9 ms: 1.08x faster
float 78.1 ms 76.6 ms: 1.02x faster
Geometric mean (ref) 1.03x faster

Benchmark hidden because not significant (1): pidigits

Benchmarks with tag 'regex':

Benchmark main-jit uop-stats-20-mc-or-better-pairs-83
regex_compile 172 ms 169 ms: 1.02x faster
regex_effbot 2.74 ms 2.75 ms: 1.01x slower
regex_dna 160 ms 161 ms: 1.01x slower
regex_v8 21.4 ms 21.9 ms: 1.02x slower
Geometric mean (ref) 1.00x slower

Benchmarks with tag 'serialize':

Benchmark main-jit uop-stats-20-mc-or-better-pairs-83
tomli_loads 2.24 sec 2.13 sec: 1.05x faster
pickle_dict 32.3 us 31.5 us: 1.03x faster
pickle_pure_python 291 us 286 us: 1.02x faster
xml_etree_process 63.4 ms 62.5 ms: 1.01x faster
json_loads 27.0 us 26.7 us: 1.01x faster
json_dumps 10.6 ms 10.5 ms: 1.01x faster
pickle 11.7 us 11.6 us: 1.01x faster
pickle_list 4.96 us 4.93 us: 1.01x faster
Geometric mean (ref) 1.01x faster

Benchmark hidden because not significant (6): unpickle, xml_etree_iterparse, unpickle_list, unpickle_pure_python, xml_etree_generate, xml_etree_parse

Benchmarks with tag 'startup':

Benchmark main-jit uop-stats-20-mc-or-better-pairs-83
python_startup_no_site 12.4 ms 12.8 ms: 1.03x slower
Geometric mean (ref) 1.01x slower

Benchmark hidden because not significant (1): python_startup

Benchmarks with tag 'template':

Benchmark main-jit uop-stats-20-mc-or-better-pairs-83
mako 11.4 ms 11.0 ms: 1.04x faster
genshi_text 24.9 ms 24.2 ms: 1.03x faster
genshi_xml 62.6 ms 63.3 ms: 1.01x slower
Geometric mean (ref) 1.02x faster

All benchmarks:

Benchmark main-jit uop-stats-20-mc-or-better-pairs-83
unpack_sequence 136 ns 97.7 ns: 1.39x faster
bench_mp_pool 42.0 ms 32.9 ms: 1.28x faster
pathlib 21.8 ms 19.1 ms: 1.15x faster
async_tree_eager_tg 83.6 ms 77.0 ms: 1.09x faster
deepcopy 378 us 349 us: 1.08x faster
nbody 85.3 ms 78.9 ms: 1.08x faster
deltablue 3.77 ms 3.52 ms: 1.07x faster
hexiom 8.28 ms 7.76 ms: 1.07x faster
scimark_monte_carlo 73.2 ms 69.4 ms: 1.05x faster
async_tree_io 1.12 sec 1.06 sec: 1.05x faster
tomli_loads 2.24 sec 2.13 sec: 1.05x faster
raytrace 307 ms 292 ms: 1.05x faster
dask 687 ms 654 ms: 1.05x faster
deepcopy_reduce 3.31 us 3.17 us: 1.04x faster
telco 8.19 ms 7.88 ms: 1.04x faster
async_tree_memoization_tg 551 ms 530 ms: 1.04x faster
async_tree_eager_memoization 275 ms 264 ms: 1.04x faster
mako 11.4 ms 11.0 ms: 1.04x faster
pyflate 507 ms 489 ms: 1.04x faster
spectral_norm 115 ms 111 ms: 1.04x faster
docutils 2.74 sec 2.65 sec: 1.03x faster
richards_super 56.0 ms 54.2 ms: 1.03x faster
deepcopy_memo 37.0 us 35.9 us: 1.03x faster
genshi_text 24.9 ms 24.2 ms: 1.03x faster
meteor_contest 101 ms 98.3 ms: 1.03x faster
scimark_fft 331 ms 322 ms: 1.03x faster
pickle_dict 32.3 us 31.5 us: 1.03x faster
logging_simple 6.95 us 6.77 us: 1.03x faster
crypto_pyaes 76.2 ms 74.2 ms: 1.03x faster
async_tree_io_tg 1.11 sec 1.08 sec: 1.03x faster
asyncio_tcp 394 ms 385 ms: 1.02x faster
async_tree_eager_memoization_tg 206 ms 201 ms: 1.02x faster
scimark_lu 145 ms 142 ms: 1.02x faster
coverage 370 ms 361 ms: 1.02x faster
logging_format 8.01 us 7.84 us: 1.02x faster
sqlglot_optimize 60.2 ms 58.9 ms: 1.02x faster
chameleon 6.51 ms 6.38 ms: 1.02x faster
generators 26.7 ms 26.2 ms: 1.02x faster
float 78.1 ms 76.6 ms: 1.02x faster
async_tree_none_tg 423 ms 415 ms: 1.02x faster
sqlglot_transpile 1.63 ms 1.60 ms: 1.02x faster
regex_compile 172 ms 169 ms: 1.02x faster
sqlite_synth 2.56 us 2.52 us: 1.02x faster
2to3 297 ms 292 ms: 1.02x faster
pickle_pure_python 291 us 286 us: 1.02x faster
sqlglot_normalize 115 ms 113 ms: 1.02x faster
typing_runtime_protocols 122 us 120 us: 1.02x faster
async_tree_memoization 541 ms 533 ms: 1.02x faster
go 151 ms 149 ms: 1.01x faster
xml_etree_process 63.4 ms 62.5 ms: 1.01x faster
richards 49.3 ms 48.7 ms: 1.01x faster
dulwich_log 83.9 ms 83.0 ms: 1.01x faster
sqlglot_parse 1.31 ms 1.29 ms: 1.01x faster
json_loads 27.0 us 26.7 us: 1.01x faster
json_dumps 10.6 ms 10.5 ms: 1.01x faster
pickle 11.7 us 11.6 us: 1.01x faster
pickle_list 4.96 us 4.93 us: 1.01x faster
comprehensions 18.2 us 18.1 us: 1.01x faster
asyncio_tcp_ssl 1.33 sec 1.32 sec: 1.00x faster
regex_effbot 2.74 ms 2.75 ms: 1.01x slower
regex_dna 160 ms 161 ms: 1.01x slower
coroutines 22.2 ms 22.3 ms: 1.01x slower
async_generators 419 ms 423 ms: 1.01x slower
async_tree_cpu_io_mixed_tg 733 ms 741 ms: 1.01x slower
asyncio_websockets 442 ms 447 ms: 1.01x slower
async_tree_eager_cpu_io_mixed 473 ms 478 ms: 1.01x slower
genshi_xml 62.6 ms 63.3 ms: 1.01x slower
scimark_sor 140 ms 142 ms: 1.01x slower
create_gc_cycles 1.06 ms 1.08 ms: 1.02x slower
async_tree_eager_io 1.06 sec 1.09 sec: 1.02x slower
regex_v8 21.4 ms 21.9 ms: 1.02x slower
async_tree_eager_cpu_io_mixed_tg 409 ms 420 ms: 1.03x slower
python_startup_no_site 12.4 ms 12.8 ms: 1.03x slower
Geometric mean (ref) 1.02x faster

Benchmark hidden because not significant (24): pprint_safe_repr, fannkuch, nqueens, unpickle, gc_traversal, xml_etree_iterparse, python_startup, scimark_sparse_mat_mult, tornado_http, chaos, unpickle_list, unpickle_pure_python, async_tree_cpu_io_mixed, logging_silent, pprint_pformat, async_tree_eager, xml_etree_generate, pidigits, async_tree_none, html5lib, mdp, async_tree_eager_io_tg, xml_etree_parse, bench_thread_pool

Using the 251 pairs* that reduce the machine code count by 13 or more: ~4% Faster (Local)

Benchmarks with tag 'apps':

Benchmark main-jit uop-stats-13-mc-or-better-pairs-251
2to3 297 ms 284 ms: 1.04x faster
docutils 2.74 sec 2.62 sec: 1.05x faster
Geometric mean (ref) 1.02x faster

Benchmark hidden because not significant (3): chameleon, html5lib, tornado_http

Benchmarks with tag 'asyncio':

Benchmark main-jit uop-stats-13-mc-or-better-pairs-251
async_tree_eager_tg 83.6 ms 77.5 ms: 1.08x faster
async_tree_memoization_tg 551 ms 522 ms: 1.05x faster
async_tree_io_tg 1.11 sec 1.06 sec: 1.05x faster
async_tree_io 1.12 sec 1.07 sec: 1.05x faster
async_tree_memoization 541 ms 518 ms: 1.04x faster
async_tree_eager_cpu_io_mixed_tg 409 ms 392 ms: 1.04x faster
async_tree_eager_cpu_io_mixed 473 ms 455 ms: 1.04x faster
async_tree_eager_memoization 275 ms 265 ms: 1.04x faster
async_tree_cpu_io_mixed 732 ms 708 ms: 1.03x faster
async_tree_cpu_io_mixed_tg 733 ms 718 ms: 1.02x faster
async_tree_eager_memoization_tg 206 ms 202 ms: 1.02x faster
async_tree_eager 116 ms 114 ms: 1.02x faster
Geometric mean (ref) 1.03x faster

Benchmark hidden because not significant (4): async_tree_none, async_tree_eager_io, async_tree_none_tg, async_tree_eager_io_tg

Benchmarks with tag 'math':

Benchmark main-jit uop-stats-13-mc-or-better-pairs-251
nbody 85.3 ms 77.2 ms: 1.11x faster
float 78.1 ms 74.9 ms: 1.04x faster
pidigits 191 ms 183 ms: 1.04x faster
Geometric mean (ref) 1.06x faster

Benchmarks with tag 'regex':

Benchmark main-jit uop-stats-13-mc-or-better-pairs-251
regex_compile 172 ms 162 ms: 1.06x faster
regex_dna 160 ms 161 ms: 1.00x slower
regex_v8 21.4 ms 21.8 ms: 1.02x slower
Geometric mean (ref) 1.01x faster

Benchmark hidden because not significant (1): regex_effbot

Benchmarks with tag 'serialize':

Benchmark main-jit uop-stats-13-mc-or-better-pairs-251
tomli_loads 2.24 sec 2.06 sec: 1.09x faster
pickle_dict 32.3 us 31.5 us: 1.03x faster
unpickle_pure_python 244 us 238 us: 1.03x faster
xml_etree_parse 142 ms 140 ms: 1.01x faster
pickle_list 4.96 us 4.89 us: 1.01x faster
pickle_pure_python 291 us 287 us: 1.01x faster
xml_etree_iterparse 103 ms 102 ms: 1.01x faster
pickle 11.7 us 11.5 us: 1.01x faster
xml_etree_process 63.4 ms 62.9 ms: 1.01x faster
json_dumps 10.6 ms 10.5 ms: 1.01x faster
xml_etree_generate 94.0 ms 93.5 ms: 1.01x faster
unpickle 15.6 us 15.8 us: 1.01x slower
Geometric mean (ref) 1.02x faster

Benchmark hidden because not significant (2): unpickle_list, json_loads

Benchmarks with tag 'startup':

Benchmark main-jit uop-stats-13-mc-or-better-pairs-251
python_startup 13.4 ms 13.6 ms: 1.02x slower
python_startup_no_site 12.4 ms 12.7 ms: 1.02x slower
Geometric mean (ref) 1.02x slower

Benchmarks with tag 'template':

Benchmark main-jit uop-stats-13-mc-or-better-pairs-251
mako 11.4 ms 10.8 ms: 1.05x faster
genshi_text 24.9 ms 23.9 ms: 1.04x faster
genshi_xml 62.6 ms 61.3 ms: 1.02x faster
Geometric mean (ref) 1.04x faster

All benchmarks:

Benchmark main-jit uop-stats-13-mc-or-better-pairs-251
unpack_sequence 136 ns 98.7 ns: 1.38x faster
pathlib 21.8 ms 18.9 ms: 1.16x faster
scimark_monte_carlo 73.2 ms 65.0 ms: 1.13x faster
pyflate 507 ms 450 ms: 1.13x faster
hexiom 8.28 ms 7.43 ms: 1.11x faster
nbody 85.3 ms 77.2 ms: 1.11x faster
deltablue 3.77 ms 3.44 ms: 1.10x faster
raytrace 307 ms 281 ms: 1.09x faster
tomli_loads 2.24 sec 2.06 sec: 1.09x faster
pprint_safe_repr 946 ms 870 ms: 1.09x faster
pprint_pformat 1.99 sec 1.83 sec: 1.08x faster
deepcopy 378 us 349 us: 1.08x faster
async_tree_eager_tg 83.6 ms 77.5 ms: 1.08x faster
spectral_norm 115 ms 107 ms: 1.07x faster
scimark_fft 331 ms 311 ms: 1.07x faster
crypto_pyaes 76.2 ms 71.5 ms: 1.07x faster
regex_compile 172 ms 162 ms: 1.06x faster
dask 687 ms 646 ms: 1.06x faster
chaos 69.1 ms 65.1 ms: 1.06x faster
richards_super 56.0 ms 52.8 ms: 1.06x faster
richards 49.3 ms 46.5 ms: 1.06x faster
async_tree_memoization_tg 551 ms 522 ms: 1.05x faster
mako 11.4 ms 10.8 ms: 1.05x faster
telco 8.19 ms 7.79 ms: 1.05x faster
deepcopy_reduce 3.31 us 3.15 us: 1.05x faster
deepcopy_memo 37.0 us 35.2 us: 1.05x faster
async_tree_io_tg 1.11 sec 1.06 sec: 1.05x faster
async_tree_io 1.12 sec 1.07 sec: 1.05x faster
docutils 2.74 sec 2.62 sec: 1.05x faster
async_tree_memoization 541 ms 518 ms: 1.04x faster
coverage 370 ms 354 ms: 1.04x faster
2to3 297 ms 284 ms: 1.04x faster
nqueens 97.4 ms 93.4 ms: 1.04x faster
async_tree_eager_cpu_io_mixed_tg 409 ms 392 ms: 1.04x faster
genshi_text 24.9 ms 23.9 ms: 1.04x faster
float 78.1 ms 74.9 ms: 1.04x faster
logging_simple 6.95 us 6.68 us: 1.04x faster
pidigits 191 ms 183 ms: 1.04x faster
sqlglot_optimize 60.2 ms 57.9 ms: 1.04x faster
sqlglot_transpile 1.63 ms 1.57 ms: 1.04x faster
async_tree_eager_cpu_io_mixed 473 ms 455 ms: 1.04x faster
logging_format 8.01 us 7.72 us: 1.04x faster
fannkuch 423 ms 408 ms: 1.04x faster
async_tree_eager_memoization 275 ms 265 ms: 1.04x faster
comprehensions 18.2 us 17.6 us: 1.04x faster
scimark_sparse_mat_mult 4.82 ms 4.66 ms: 1.03x faster
async_tree_cpu_io_mixed 732 ms 708 ms: 1.03x faster
go 151 ms 147 ms: 1.03x faster
meteor_contest 101 ms 98.4 ms: 1.03x faster
sqlglot_parse 1.31 ms 1.27 ms: 1.03x faster
pickle_dict 32.3 us 31.5 us: 1.03x faster
unpickle_pure_python 244 us 238 us: 1.03x faster
sqlite_synth 2.56 us 2.49 us: 1.03x faster
gc_traversal 3.31 ms 3.22 ms: 1.03x faster
scimark_lu 145 ms 142 ms: 1.02x faster
dulwich_log 83.9 ms 82.1 ms: 1.02x faster
async_tree_cpu_io_mixed_tg 733 ms 718 ms: 1.02x faster
sqlglot_normalize 115 ms 112 ms: 1.02x faster
genshi_xml 62.6 ms 61.3 ms: 1.02x faster
async_tree_eager_memoization_tg 206 ms 202 ms: 1.02x faster
create_gc_cycles 1.06 ms 1.05 ms: 1.02x faster
async_tree_eager 116 ms 114 ms: 1.02x faster
xml_etree_parse 142 ms 140 ms: 1.01x faster
pickle_list 4.96 us 4.89 us: 1.01x faster
asyncio_tcp_ssl 1.33 sec 1.31 sec: 1.01x faster
pickle_pure_python 291 us 287 us: 1.01x faster
xml_etree_iterparse 103 ms 102 ms: 1.01x faster
pickle 11.7 us 11.5 us: 1.01x faster
generators 26.7 ms 26.5 ms: 1.01x faster
xml_etree_process 63.4 ms 62.9 ms: 1.01x faster
json_dumps 10.6 ms 10.5 ms: 1.01x faster
xml_etree_generate 94.0 ms 93.5 ms: 1.01x faster
asyncio_websockets 442 ms 441 ms: 1.00x faster
regex_dna 160 ms 161 ms: 1.00x slower
async_generators 419 ms 421 ms: 1.01x slower
unpickle 15.6 us 15.8 us: 1.01x slower
coroutines 22.2 ms 22.4 ms: 1.01x slower
python_startup 13.4 ms 13.6 ms: 1.02x slower
regex_v8 21.4 ms 21.8 ms: 1.02x slower
logging_silent 103 ns 105 ns: 1.02x slower
python_startup_no_site 12.4 ms 12.7 ms: 1.02x slower
Geometric mean (ref) 1.04x faster

Benchmark hidden because not significant (16): bench_mp_pool, async_tree_none, tornado_http, typing_runtime_protocols, chameleon, unpickle_list, json_loads, asyncio_tcp, scimark_sor, regex_effbot, async_tree_eager_io, async_tree_none_tg, mdp, html5lib, async_tree_eager_io_tg, bench_thread_pool


*Not including pairs that include _JUMP_TO_TOP - currently, including that in a superinstruction segfaults.

@JeffersGlass
Copy link
Contributor Author

JeffersGlass commented Jul 3, 2024

I thought I'd share the an update on the Superinstruction experiments. I've started from scratch in this branch by teaching the bytecode analyzer to understand superinstructions:

// supernodes.c
super() = _LOAD_FAST_1 + _GUARD_BOTH_INT
super() = _TIER2_RESUME_CHECK + _SET_IP
super() = _BINARY_OP_ADD_INT + _LOAD_CONST_INLINE_BORROW

Which allows it to calculate and emit metadata and IDs the same way it handles bytecodes:

[_LOAD_FAST_1_PLUS__GUARD_BOTH_INT] = HAS_LOCAL_FLAG | HAS_EXIT_FLAG,
[_TIER2_RESUME_CHECK_PLUS__SET_IP] = HAS_DEOPT_FLAG | HAS_OPERAND_FLAG,
[_BINARY_OP_ADD_INT_PLUS__LOAD_CONST_INLINE_BORROW] = HAS_ERROR_FLAG | HAS_PURE_FLAG | HAS_OPERAND_FLAG,
...

#define _WITH_EXCEPT_START WITH_EXCEPT_START
#define _YIELD_VALUE YIELD_VALUE
#define MAX_VANILLA_UOP_ID 451

#define _LOAD_FAST_1_PLUS__GUARD_BOTH_INT 452
#define _TIER2_RESUME_CHECK_PLUS__SET_IP 453
#define _BINARY_OP_ADD_INT_PLUS__LOAD_CONST_INLINE_BORROW 454
#define MAX_UOP_ID 454

We generate a switch statement in a similar way to previous experiments (in jit_switch_generator.py), but this time that function is included via a header file instead of modifying template.c directly, making it much more readable.

//jit_switch.c
SuperNode
_JIT_INDEX(const _PyUOpInstruction *uops, uint16_t start_index) {
    switch (uops[start_index + 0].opcode) {
        case _LOAD_FAST_1:
            switch (uops[start_index + 1].opcode) {
                case _GUARD_BOTH_INT:
                    return (SuperNode) {.index = _LOAD_FAST_1_PLUS__GUARD_BOTH_INT, .length = 2};
                    break;
                ...

The biggest new experiment is automatically iterating on sets of supernodes. Tools/scripts/supernode_analysis.py contains tools for analyzing a set of pystats and deriving the a new set of supernodes from the given data. For instance:

# Run (up to) 5 generations of build/run-pystats/derive-new-supernodes, using 4 threads for builds, verbose=1, running only the docutils benchmark
$ python Tools/scripts/supernode_analysis.py iterate -v -i5 -j4 -b docutils

Beginning supernode generation process for 10 iterations max
Starting supernode generation 1 of 10
  Generating statistics
  Updating supernode metadata and building JIT
  Generating supernodes from stats
  Added 187 of 1224 possible supernodes that make up more than 0.1% of nodes and are viable
  Updating supernode metadata and building JIT
  Added 128 supernodes
Starting supernode generation 2 of 10
  Generating statistics
  Stat-ing python with 128 of 128 nodes
  Updating supernode metadata and building JIT
  ...
# see help for full description
$ python Tools/scripts/supernode_analysis --help

There are still some bugs floating around in superinstruction construction/usage (and possibly in Tier 2 itself?), so this script will, by default, detect errors during Python builds and during the pystats runs, bisect to find the troublesome superinstructions, and remove them from the run:

  ...
  Stat-ing python with 28 of 28 nodes
  Updating supernodes.c
  Updating supernode metadata and building JIT
  Stat FAILED, bisecting
  Stat-ing python with 14 of 28 nodes
  Updating supernodes.c
  ...
  Identified bad node during stat: _START_EXECUTOR_PLUS__POP_TOP
  Building Python with 27 nodes
  ...

There's much more to be done - I'm tracking granular todos in an issue on my fork, but some big areas of investigation:

  • How to select supernodes using pystats. Currently, there's a hardcoded percentage threshhold to add a pair as a new supernode, and a (lower) threshhold to drop an existing supernode. These should at least be tuned, but probably there are smarter metrics. Possibly data on how much each supernode decreases byte length should be incorporated, but unclear at what stage.
  • Supernode format - the use of one oparg/operand/target per superinstruction may prove limiting in the long run - the longest superinstruction I've seen 'in the wild' is 7 instructions (_CHECK_PERIODIC + _CHECK_VALIDITY + _STORE_FAST_4 + _LOAD_FAST_3 + _LOAD_FAST_4 + _LOAD_FAST_2 + _BUILD_TUPLE), but longer sequences are surely possible with a more flexible format.
  • How efficient is supernode selection at patch time - i.e. the giant generated switch statement that hopefully the compiler is optimizing.
  • Currently linux/mac, need to work on Windows build steps. And if byte-length data is used, need to assess its relative merits across various platforms.

It was a pleasure to meet so many in this thread at PyConUS in May - thanks for conversations during the open spaces and sprints. I hope these experiments prove useful - I will share more observations and results as they pop up.

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Dec 21, 2024

FWIW, I gave another short at super instructions. With the newest JIT, it shows no speedup on my computer https://github.com/Fidget-Spinner/cpython/pull/new/Fidget-Spinner:cpython:supernodes. Even on small microbenchmarks (e.g. iterative fibonacci). This branch's super instructions support up to 7 instructions (14 operands!). (Last working commit: d95dcdd55106f7f228ac8c84a979d52cfeb4578b)

I suspect most of the previous wins was from removing the zero-length jumps. Since the JIT is now emitting more efficient code, I don't think we need this anymore.

@Fidget-Spinner
Copy link
Collaborator

I tried a true "baseline" JIT: that is, turning tier 1 bytecode directly to JIT stencils where possible. https://github.com/Fidget-Spinner/cpython/pull/new/Fidget-Spinner:cpython:tier1_baseline

This should be the a really strong case for superinstructions. However, there's almost no speedup there too on iterative fibonacci.

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Dec 29, 2024

I pulled out the copy and patch paper again and indeed this corresponds to what i've found: only a 1% speedup on fibonacci:

image
(Source: Copy-and-Patch Compilation by Haoran Xu and Fredrik Kjolstad)

I deem this not worth the implementation effort (and extra build time). Tier 2 build time (ie make) is a lot slower on my machine due to having to support a lot more stencils.

Seems like regalloc is the next most promising optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants