Superinstructions for Copy & Patch JIT #647

JeffersGlass · 2024-01-25T23:16:03Z

Inspired by @brandtbucher's recorded talk from the CPython core sprint, and his work in the Copy & Patch JIT PR, I've worked up version of that JIT that allows for 'superinstructions'. That is, pairs/triples/sequences of instructions that are pre-compiled into stencils, the same way single-UOps are in the current PR.

The branch at JeffersGlass/cpython/tree/justin-supernodes shows this in action. If you build that branch with ./configure --enable-experimentaljit, make, all of the opcode sequences listed in Tools/jit/superinstructions.csv will be built into stencils, and made available to optimizer to JIT-compile with.

I refer to the length of the longest sequence of UOps in a single superinstruction as the "depth" of the superinstruction set. Much of the complexity of this branch stems from the desire to allow the builder to input sequences of any depth, and simply accomodate them in the build process.

Key Changes

Multiple template.c files that serve as the basis for creating the JIT stencils must be created. This is handled in _template.py, at build-time.
jit.c is constructed at build-time by _jic_c.py, using _jit_template.c as a template.
- jit.c also includes a new function, _JIT_INDEX, which uses a nested switch statement generated from the superinstructions list to select the correct superinstruction (if any) from the upcoming UOps
A new jit_defines.h file is also emitted at build time, with indices for the new superinstructions and some other utilility data (MAX_SUPERINST_ID)
The _stencils.HoleValue Enum is created dynamically after the depth is known

Places for Improvment

Better Superinstruction Choices

The version of superinstructions.csv is, at the moment, a smattering of short sequences that popped up during testing. It's not vetted, and most of those combinations may not even been significantly shorter than just JIT-ing their components individually.

Brandt suggested that adding instrumentation to --enable-pystats to log adjacent op pairs for tier two, like is currently done for tier 1. That's a challenge I would be interested in taking on, time permitting.

Better Superinstruction Selection at JIT-Time

As you'll see in the built jit.c:_JIT_INDEX(), the way that the optimizer selects which op or superinstruction to emit is via a giant nested switch statement, which I'm counting on the compiler to 'cleverly' turn into something more efficient and compact.

There's surely a better way to do that matching - the XU/Kjolstad Paper mentions a "tree-matching" technique, but I couldn't track it done quickly in either of their reference projects. Or perhaps a windowed lookup of some kind.

Benchmarking

I haven't done any benchmarking with the paltry 7 superinstructions currently used, since I don't actually expect it to be faster (yet?).

Cross-Compilation

This is only tested on X86_64 Linux, as that's what I have access to / a build environment set up for. I'd be really curious it it works elsewhere.

This was mostly an experiment for my own edification, and to become more familiar with the new JIT/UOp internals. I hope some of it is useful and interesting.

Thanks to Brandt for his welcoming energy in the Python discord, and for answering my questions.

The text was updated successfully, but these errors were encountered:

brandtbucher · 2024-01-26T01:37:51Z

Thanks for exploring this, @JeffersGlass! In the interest of getting everyone on the same page, I'll repeat one of my comments from Discord:

FWIW, the best pairs will probably be the ones that compile to code that is much better than just their concatenated parts. _GUARD_BOTH_INT and _BINARY_OP_ADD_INT are definitely a common pair (at least right now), but there's not too much "clever" stuff that can happen by letting LLVM see them together. A better example would be something like _LOAD_FAST + _TO_BOOL_BOOL + _GUARD_IS_TRUE_POP, where LLVM could in theory elide the pushes, pops, and refcounts entirely.

Also, I'm still on the fence of whether it makes more sense to do superinstructions in the JIT like you have (and I also prototyped a while back), or just make them their own uops that are created during the translation pass. The latter makes things a lot cleaner for the JIT, since they look like just any other uop... at the expense of needing to handle muliple opargs/operands/targets/etc. for "single" tier 2 instructions everywhere.

As you've noted here, handling this in the JIT introduces quite a bit of complexity, and I'm still sort of leaning towards making superinstructions into their own uops, since we already have the necessary machinery to generate an interpreter loop containing some concatenated uops (and it also benefits the tier two interpreter). We would just need to tweak the tier two instruction format a bit.

Quoting you now:

Much of the complexity of this branch stems from the desire to allow the builder to input sequences of any depth, and simply accomodate them in the build process.

Then let's not allow them to be any depth! Something like 4 should be more than enough to get started.

As a quick experiment, I just prototyped a template.c style file that handles superinstructions of any length <= 4. It looks like Clang is happy to unroll the loop for us (uncomment one of the _JIT_OPCODES defines at the top to see):

https://godbolt.org/z/7ccP5offd

No need to generate any new files. :)

Brand suggested that adding instrumentation to --enable-pystats to log adjacent op pairs for tier two, like is currently done for tier 1. That's a challenge I would be interested in taking on, time permitting.

I think this is a great next step. Once we have lists of common pairs, we can start evaluating sequences that are good candidates for combining.

There's surely a better way to do that matching - the XU/Kjolstad Paper mentions a "tree-matching" technique, but I couldn't track it done quickly in either of their reference projects. Or perhaps a windowed lookup of some kind.

Let's not get too hung up on lookup speed right now, and assume it's a solved problem. There are many potential options available to us (double-lookup, binary search, hash table, etc.).

I haven't done any benchmarking with the paltry 7 superinstructions currently used, since I don't actually expect it to be faster (yet?).

Don't worry, we have dedicated benchmarking infrastructure to both collect stats and measure performance on a bunch of platforms.

This is only tested on X86_64 Linux, as that's what I have access to / a build environment set up for. I'd be really curious it it works elsewhere.

My justin branch has CI in a file called jit.yml. That will run everything on 7 different platforms (basically everything except AArch64 macOS, which will at least give you the confidence that it passes all of the tests).

JeffersGlass · 2024-01-28T20:47:49Z

My branch at JeffersGlass/pystats-uop-pairs now has functional tracking of adjacent UOp pairs in executors. The results are also output as part of summarize_stats.py. Here's some sample output:

Pair counts for top 100 uop pairs

Pair	Count	Self	Cumulative
_LOAD_FAST _SET_IP	711,096	6.2%	6.2%
_LOAD_FAST _LOAD_FAST	361,358	3.2%	9.4%
_STORE_FAST _LOAD_FAST	308,227	2.7%	12.0%
_CHECK_VALIDITY _LOAD_FAST	274,202	2.4%	14.4%
_GUARD_GLOBALS_VERSION _GUARD_BUILTINS_VERSION	251,876	2.2%	16.6%
_GUARD_BUILTINS_VERSION _LOAD_GLOBAL_BUILTINS	251,747	2.2%	18.8%
_SET_IP _GUARD_TYPE_VERSION	239,154	2.1%	20.9%
_SET_IP _CHECK_VALIDITY	227,690	2.0%	22.9%
_GUARD_GLOBALS_VERSION _LOAD_GLOBAL_MODULE	197,805	1.7%	24.6%
_CHECK_VALIDITY _STORE_FAST	197,573	1.7%	26.4%
_LOAD_CONST_INLINE_BORROW _SET_IP	197,128	1.7%	28.1%
_GUARD_TYPE_VERSION _LOAD_ATTR_METHOD_NO_DICT	194,127	1.7%	29.8%
_CHECK_VALIDITY _GUARD_IS_FALSE_POP	187,039	1.6%	31.4%
_LOAD_FAST _LOAD_CONST_INLINE_BORROW	185,910	1.6%	33.0%
_LOAD_FAST _GUARD_GLOBALS_VERSION	156,703	1.4%	34.4%
_LOAD_GLOBAL_BUILTINS _LOAD_FAST	153,591	1.3%	35.7%
_LOAD_ATTR_METHOD_NO_DICT _CHECK_VALIDITY	146,431	1.3%	37.0%
_TO_BOOL_BOOL _GUARD_IS_TRUE_POP	145,145	1.3%	38.3%
atexit _SET_IP	144,945	1.3%	39.5%
_CHECK_VALIDITY _TO_BOOL_BOOL	144,260	1.3%	40.8%
_CALL_ISINSTANCE _CHECK_VALIDITY	142,796	1.2%	42.1%
_SET_IP _FOR_ITER_TIER_TWO	136,740	1.2%	43.2%
_CHECK_FUNCTION_EXACT_ARGS _CHECK_STACK_SPACE	136,617	1.2%	44.4%
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS	136,617	1.2%	45.6%
_INIT_CALL_PY_EXACT_ARGS _SAVE_RETURN_OFFSET	136,617	1.2%	46.8%
_SAVE_RETURN_OFFSET _PUSH_FRAME	136,617	1.2%	48.0%
_SET_IP _CALL_METHOD_DESCRIPTOR_FAST	134,172	1.2%	49.2%
_CALL_METHOD_DESCRIPTOR_FAST _CHECK_VALIDITY	128,940	1.1%	50.3%
_CHECK_VALIDITY _RESUME_CHECK	128,605	1.1%	51.4%
_PUSH_FRAME _CHECK_VALIDITY	128,018	1.1%	52.5%
_FOR_ITER_TIER_TWO _CHECK_VALIDITY	127,519	1.1%	53.7%
_UNPACK_SEQUENCE_TWO_TUPLE _STORE_FAST	122,284	1.1%	54.7%
_CONTAINS_OP _CHECK_VALIDITY	119,365	1.0%	55.8%
_SET_IP _CONTAINS_OP	119,355	1.0%	56.8%
_STORE_FAST _GUARD_GLOBALS_VERSION	116,993	1.0%	57.8%
_STORE_FAST _STORE_FAST	116,369	1.0%	58.8%
_LOAD_FAST _GUARD_TYPE_VERSION	115,447	1.0%	59.9%
_POP_FRAME _CHECK_VALIDITY	109,397	1.0%	60.8%
_SET_IP _CALL_ISINSTANCE	103,697	0.9%	61.7%
atexit _ITER_CHECK_LIST	101,238	0.9%	62.6%
_SET_IP _CHECK_FUNCTION_EXACT_ARGS	99,374	0.9%	63.5%
_LOAD_GLOBAL_MODULE _SET_IP	97,857	0.9%	64.3%
_GUARD_IS_FALSE_POP _LOAD_FAST	97,769	0.9%	65.2%
_LOAD_GLOBAL_BUILTINS _SET_IP	97,156	0.8%	66.0%
_GUARD_IS_TRUE_POP _LOAD_FAST	95,694	0.8%	66.9%
_ITER_CHECK_LIST _GUARD_NOT_EXHAUSTED_LIST	92,947	0.8%	67.7%
_SET_IP _BINARY_SUBSCR_DICT	90,748	0.8%	68.5%
_LOAD_GLOBAL_MODULE _LOAD_FAST	88,716	0.8%	69.2%
_LOAD_CONST_INLINE_BORROW _LOAD_FAST	83,894	0.7%	70.0%
_GUARD_IS_TRUE_POP _GUARD_GLOBALS_VERSION	83,617	0.7%	70.7%
_SET_IP _LOAD_ATTR	82,827	0.7%	71.4%
_SET_IP _COMPARE_OP_STR	82,410	0.7%	72.1%
_RESUME_CHECK _GUARD_GLOBALS_VERSION	82,349	0.7%	72.9%
_CHECK_VALIDITY _UNPACK_SEQUENCE_TWO_TUPLE	79,318	0.7%	73.5%
_GUARD_NOT_EXHAUSTED_LIST _ITER_NEXT_LIST	78,943	0.7%	74.2%
_COMPARE_OP_STR _CHECK_VALIDITY	75,901	0.7%	74.9%
_SET_IP _GUARD_BOTH_INT	69,759	0.6%	75.5%
_GUARD_TYPE_VERSION _CHECK_MANAGED_OBJECT_HAS_VALUES	64,321	0.6%	76.1%
_CHECK_MANAGED_OBJECT_HAS_VALUES _LOAD_ATTR_INSTANCE_VALUE	64,321	0.6%	76.6%
_GUARD_BOTH_INT _BINARY_OP_ADD_INT	62,519	0.5%	77.2%
_POP_TOP _LOAD_FAST	60,577	0.5%	77.7%
_CHECK_VALIDITY _POP_TOP	57,490	0.5%	78.2%
_CHECK_VALIDITY _JUMP_TO_TOP	55,969	0.5%	78.7%
_GUARD_TYPE_VERSION _LOAD_ATTR_SLOT	51,204	0.4%	79.1%
_CHECK_VALIDITY _POP_FRAME	50,063	0.4%	79.6%
_LOAD_ATTR_SLOT _SET_IP	49,289	0.4%	80.0%
_LOAD_CONST_INLINE_BORROW _POP_FRAME	49,252	0.4%	80.4%
_GUARD_IS_FALSE_POP _JUMP_TO_TOP	47,970	0.4%	80.9%
_LOAD_ATTR_METHOD_NO_DICT _SET_IP	47,696	0.4%	81.3%
_CALL_METHOD_DESCRIPTOR_FAST _SET_IP	47,408	0.4%	81.7%
_BINARY_SUBSCR_DICT _CHECK_VALIDITY	46,978	0.4%	82.1%
_LOAD_FAST _TO_BOOL_STR	44,361	0.4%	82.5%
_STORE_SUBSCR_DICT _CHECK_VALIDITY	44,238	0.4%	82.9%
_SET_IP _STORE_SUBSCR_DICT	44,175	0.4%	83.3%
_RESUME_CHECK _LOAD_FAST	44,169	0.4%	83.6%
_BINARY_SUBSCR_DICT _SET_IP	43,673	0.4%	84.0%
_TO_BOOL_STR _GUARD_IS_TRUE_POP	43,145	0.4%	84.4%
_LOAD_ATTR _CHECK_VALIDITY	43,063	0.4%	84.8%
_CHECK_VALIDITY _CALL_METHOD_DESCRIPTOR_FAST	42,176	0.4%	85.1%
_STORE_FAST _LOAD_CONST_INLINE_BORROW	41,992	0.4%	85.5%
_ITER_NEXT_LIST _UNPACK_SEQUENCE_TWO_TUPLE	41,772	0.4%	85.9%
_CHECK_VALIDITY _COMPARE_OP_FLOAT	41,655	0.4%	86.2%
_LOAD_ATTR _SET_IP	40,322	0.4%	86.6%
_GUARD_IS_FALSE_POP _LOAD_CONST_INLINE_BORROW	39,393	0.3%	86.9%
_CHECK_VALIDITY _CALL_ISINSTANCE	39,099	0.3%	87.3%
_CHECK_VALIDITY _GUARD_IS_NONE_POP	39,048	0.3%	87.6%
_ITER_CHECK_TUPLE _GUARD_NOT_EXHAUSTED_TUPLE	37,709	0.3%	87.9%
_ITER_NEXT_LIST _STORE_FAST	37,170	0.3%	88.3%
_SET_IP _LOAD_CONST_INLINE_BORROW	37,112	0.3%	88.6%
_GUARD_IS_NONE_POP _SET_IP	37,054	0.3%	88.9%
atexit _ITER_CHECK_TUPLE	35,909	0.3%	89.2%
_LOAD_FAST _BINARY_SUBSCR_STR_INT	33,275	0.3%	89.5%
_LOAD_ATTR_INSTANCE_VALUE _STORE_FAST	32,604	0.3%	89.8%
atexit _LOAD_FAST	31,315	0.3%	90.1%
_GUARD_NOT_EXHAUSTED_TUPLE _ITER_NEXT_TUPLE	30,136	0.3%	90.3%
_BINARY_OP_ADD_INT _STORE_FAST	29,829	0.3%	90.6%
_ITER_NEXT_TUPLE _STORE_FAST	29,229	0.3%	90.9%
_CHECK_VALIDITY _LOAD_CONST_INLINE_BORROW	26,541	0.2%	91.1%
_CHECK_VALIDITY _CHECK_FUNCTION_EXACT_ARGS	25,673	0.2%	91.3%
_SET_IP _COMPARE_OP_INT	25,400	0.2%	91.5%

The pair atexit _SET_IP is a byproduct of my using the sentinel value of 511 as the "last opcode" at the start of each JIT execution. Looks a bit odd, I can clean that up. =)

Currently, this branch works in the JIT by building a call to a _inc_uop_stats function into the JIT template, with the appropriate arguments (LASTUOP) passed in at runtime. My original idea was slightly different: since we the address to be incremented (_Py_stats->optimization_stats.opcode[lastopname].pair_count[opname]) at patch-time, we could just pass that memory location to the JIT template and manually increment that location. But for some reason I couldn't get that to work; perhaps I'll give it another go at some point.

That branch works in the JIT'd case, and I've built in what I think is necessary to run it sans-JIT, but... I realize I don't know how to build/run the Tier 2 interpreter without the JIT. I found the -X uops arg... is there a flag to pass to configure as well?

Related - Is there a simple way to run pyperformance locally with pystats enabled?

If it seems like this is on the right track at least, I can work on turning it into a PR. I may need a little guidance on code organization - I splashed the new function and its declaration in a little haphazardly. This in addition to figuring out how to make the stats call an "optional" part of the template.

After that, I think a script to "score" pairs/sequences of UOps (i.e. how much shorter they are when compiled together vs. separately) could be useful and interesting to build.

Thanks again for your support and time, I'm having a swell time getting to know the jit internals better.

JeffersGlass · 2024-01-29T15:14:48Z

As a brief experiment, I ran (local) pyperformance on both the current 3.13.0a3 build and my experimental jitted version above with the 96 opcodes pairs listed in the previous post. In general, the jitted version is just a little slower, though that's to be expected from the caveats we've been talking about, and the fact that most of these superinstructions probably don't help much. A few specific benchmarks were faster, though

Curiously, bench_mp_pool was 4.77 times slower. I wonder why!

Stats for pyperf compare mainall.json jitall.json

mainall.json

Performance version: 1.10.0
Report on Linux-6.5.0-14-generic-x86_64-with-glibc2.35
Number of logical CPUs: 8
Start date: 2024-01-29 07:34:11.664325
End date: 2024-01-29 08:24:42.102909

jitall.json

Performance version: 1.10.0
Report on Linux-6.5.0-14-generic-x86_64-with-glibc2.35
Number of logical CPUs: 8
Start date: 2024-01-28 19:18:03.068205
End date: 2024-01-28 20:09:39.180202

2to3

Mean +- std dev: 261 ms +- 2 ms -> 274 ms +- 3 ms: 1.05x slower
Significant (t=-27.65)

async_generators

Mean +- std dev: 400 ms +- 4 ms -> 408 ms +- 5 ms: 1.02x slower
Significant (t=-10.18)

async_tree_cpu_io_mixed

Mean +- std dev: 657 ms +- 18 ms -> 737 ms +- 21 ms: 1.12x slower
Significant (t=-22.27)

async_tree_cpu_io_mixed_tg

Mean +- std dev: 662 ms +- 17 ms -> 752 ms +- 26 ms: 1.14x slower
Significant (t=-22.38)

async_tree_eager

Mean +- std dev: 110 ms +- 1 ms -> 121 ms +- 4 ms: 1.10x slower
Significant (t=-20.61)

async_tree_eager_cpu_io_mixed

Mean +- std dev: 411 ms +- 10 ms -> 477 ms +- 7 ms: 1.16x slower
Significant (t=-42.43)

async_tree_eager_cpu_io_mixed_tg

Mean +- std dev: 356 ms +- 7 ms -> 417 ms +- 6 ms: 1.17x slower
Significant (t=-51.94)

async_tree_eager_io

Mean +- std dev: 1.02 sec +- 0.05 sec -> 1.03 sec +- 0.05 sec: 1.01x slower
Not significant

async_tree_eager_io_tg

Mean +- std dev: 1.04 sec +- 0.05 sec -> 1.06 sec +- 0.06 sec: 1.02x slower
Not significant

async_tree_eager_memoization

Mean +- std dev: 257 ms +- 4 ms -> 262 ms +- 4 ms: 1.02x slower
Significant (t=-8.01)

async_tree_eager_memoization_tg

Mean +- std dev: 200 ms +- 6 ms -> 203 ms +- 6 ms: 1.01x slower
Not significant

async_tree_eager_tg

Mean +- std dev: 76.9 ms +- 1.3 ms -> 80.4 ms +- 1.9 ms: 1.05x slower
Significant (t=-11.73)

async_tree_io

Mean +- std dev: 1.01 sec +- 0.01 sec -> 1.04 sec +- 0.03 sec: 1.02x slower
Significant (t=-5.63)

async_tree_io_tg

Mean +- std dev: 1.01 sec +- 0.01 sec -> 1.04 sec +- 0.01 sec: 1.02x slower
Significant (t=-11.47)

async_tree_memoization

Mean +- std dev: 506 ms +- 16 ms -> 512 ms +- 14 ms: 1.01x slower
Not significant

async_tree_memoization_tg

Mean +- std dev: 507 ms +- 12 ms -> 517 ms +- 14 ms: 1.02x slower
Not significant

async_tree_none

Mean +- std dev: 396 ms +- 20 ms -> 413 ms +- 24 ms: 1.04x slower
Significant (t=-4.22)

async_tree_none_tg

Mean +- std dev: 403 ms +- 3 ms -> 408 ms +- 7 ms: 1.01x slower
Not significant

asyncio_tcp

Mean +- std dev: 380 ms +- 7 ms -> 385 ms +- 7 ms: 1.01x slower
Not significant

asyncio_tcp_ssl

Mean +- std dev: 1.29 sec +- 0.01 sec -> 1.31 sec +- 0.01 sec: 1.01x slower
Not significant

asyncio_websockets

Mean +- std dev: 435 ms +- 2 ms -> 437 ms +- 2 ms: 1.00x slower
Not significant

bench_mp_pool

Mean +- std dev: 10.5 ms +- 5.5 ms -> 50.2 ms +- 35.9 ms: 4.77x slower
Significant (t=-8.47)

bench_thread_pool

Mean +- std dev: 4.02 ms +- 0.80 ms -> 3.87 ms +- 0.91 ms: 1.04x faster
Not significant

chameleon

Mean +- std dev: 6.70 ms +- 0.08 ms -> 6.78 ms +- 0.22 ms: 1.01x slower
Not significant

chaos

Mean +- std dev: 56.3 ms +- 0.5 ms -> 65.9 ms +- 1.5 ms: 1.17x slower
Significant (t=-46.87)

comprehensions

Mean +- std dev: 15.9 us +- 0.1 us -> 17.3 us +- 0.2 us: 1.09x slower
Significant (t=-45.85)

coroutines

Mean +- std dev: 22.0 ms +- 0.1 ms -> 22.0 ms +- 0.3 ms: 1.00x faster
Not significant

create_gc_cycles

Mean +- std dev: 1.01 ms +- 0.01 ms -> 1.01 ms +- 0.01 ms: 1.00x slower
Not significant

crypto_pyaes

Mean +- std dev: 64.8 ms +- 0.7 ms -> 73.4 ms +- 1.8 ms: 1.13x slower
Significant (t=-34.25)

dask

Mean +- std dev: 630 ms +- 12 ms -> 643 ms +- 11 ms: 1.02x slower
Significant (t=-6.21)

deepcopy

Mean +- std dev: 337 us +- 3 us -> 339 us +- 4 us: 1.01x slower
Not significant

deepcopy_memo

Mean +- std dev: 33.9 us +- 0.5 us -> 34.5 us +- 0.7 us: 1.02x slower
Not significant

deepcopy_reduce

Mean +- std dev: 3.09 us +- 0.03 us -> 3.15 us +- 0.04 us: 1.02x slower
Not significant

deltablue

Mean +- std dev: 3.20 ms +- 0.05 ms -> 3.49 ms +- 0.04 ms: 1.09x slower
Significant (t=-36.14)

docutils

Mean +- std dev: 2.45 sec +- 0.02 sec -> 2.49 sec +- 0.03 sec: 1.02x slower
Not significant

dulwich_log

Mean +- std dev: 75.1 ms +- 1.1 ms -> 76.9 ms +- 0.9 ms: 1.02x slower
Significant (t=-10.13)

fannkuch

Mean +- std dev: 387 ms +- 4 ms -> 406 ms +- 7 ms: 1.05x slower
Significant (t=-19.60)

float

Mean +- std dev: 74.2 ms +- 0.9 ms -> 75.3 ms +- 0.9 ms: 1.01x slower
Not significant

gc_traversal

Mean +- std dev: 2.94 ms +- 0.01 ms -> 3.30 ms +- 0.02 ms: 1.12x slower
Significant (t=-107.75)

generators

Mean +- std dev: 30.8 ms +- 0.3 ms -> 26.2 ms +- 0.3 ms: 1.18x faster
Significant (t=84.31)

genshi_text

Mean +- std dev: 22.3 ms +- 0.3 ms -> 21.8 ms +- 0.3 ms: 1.02x faster
Significant (t=10.24)

genshi_xml

Mean +- std dev: 52.4 ms +- 0.8 ms -> 52.1 ms +- 1.0 ms: 1.01x faster
Not significant

go

Mean +- std dev: 124 ms +- 1 ms -> 133 ms +- 2 ms: 1.07x slower
Significant (t=-38.20)

hexiom

Mean +- std dev: 5.79 ms +- 0.04 ms -> 6.81 ms +- 0.08 ms: 1.18x slower
Significant (t=-84.83)

html5lib

Mean +- std dev: 62.0 ms +- 2.5 ms -> 63.3 ms +- 2.7 ms: 1.02x slower
Significant (t=-2.78)

json_dumps

Mean +- std dev: 10.2 ms +- 0.1 ms -> 10.4 ms +- 0.3 ms: 1.02x slower
Not significant

json_loads

Mean +- std dev: 25.8 us +- 0.7 us -> 26.2 us +- 0.2 us: 1.01x slower
Not significant

logging_format

Mean +- std dev: 7.20 us +- 0.16 us -> 7.50 us +- 0.21 us: 1.04x slower
Significant (t=-8.48)

logging_silent

Mean +- std dev: 97.1 ns +- 1.4 ns -> 100.8 ns +- 1.3 ns: 1.04x slower
Significant (t=-15.19)

logging_simple

Mean +- std dev: 6.33 us +- 0.14 us -> 6.53 us +- 0.15 us: 1.03x slower
Significant (t=-7.73)

mako

Mean +- std dev: 10.2 ms +- 0.1 ms -> 11.4 ms +- 0.6 ms: 1.12x slower
Significant (t=-16.03)

mdp

Mean +- std dev: 2.91 sec +- 0.06 sec -> 2.82 sec +- 0.02 sec: 1.03x faster
Significant (t=11.03)

meteor_contest

Mean +- std dev: 96.9 ms +- 0.9 ms -> 96.1 ms +- 0.5 ms: 1.01x faster
Not significant

nbody

Mean +- std dev: 78.0 ms +- 1.6 ms -> 84.0 ms +- 0.7 ms: 1.08x slower
Significant (t=-27.26)

nqueens

Mean +- std dev: 84.9 ms +- 0.7 ms -> 87.7 ms +- 1.3 ms: 1.03x slower
Significant (t=-14.92)

pathlib

Mean +- std dev: 20.3 ms +- 0.4 ms -> 20.9 ms +- 0.5 ms: 1.03x slower
Significant (t=-7.69)

pickle

Mean +- std dev: 11.3 us +- 0.1 us -> 11.2 us +- 0.2 us: 1.00x faster
Not significant

pickle_dict

Mean +- std dev: 31.8 us +- 0.2 us -> 30.7 us +- 0.2 us: 1.03x faster
Significant (t=35.19)

pickle_list

Mean +- std dev: 4.88 us +- 0.03 us -> 4.79 us +- 0.05 us: 1.02x faster
Not significant

pickle_pure_python

Mean +- std dev: 279 us +- 4 us -> 281 us +- 4 us: 1.01x slower
Not significant

pidigits

Mean +- std dev: 170 ms +- 0 ms -> 189 ms +- 2 ms: 1.11x slower
Significant (t=-65.49)

pprint_pformat

Mean +- std dev: 1.52 sec +- 0.01 sec -> 1.64 sec +- 0.03 sec: 1.08x slower
Significant (t=-29.38)

pprint_safe_repr

Mean +- std dev: 743 ms +- 7 ms -> 794 ms +- 21 ms: 1.07x slower
Significant (t=-18.02)

pyflate

Mean +- std dev: 403 ms +- 3 ms -> 435 ms +- 9 ms: 1.08x slower
Significant (t=-24.86)

python_startup

Mean +- std dev: 11.5 ms +- 1.6 ms -> 11.2 ms +- 1.2 ms: 1.03x faster
Significant (t=2.11)

python_startup_no_site

Mean +- std dev: 10.3 ms +- 1.4 ms -> 10.5 ms +- 1.5 ms: 1.02x slower
Not significant

raytrace

Mean +- std dev: 242 ms +- 1 ms -> 253 ms +- 2 ms: 1.05x slower
Significant (t=-37.83)

regex_compile

Mean +- std dev: 127 ms +- 1 ms -> 135 ms +- 1 ms: 1.06x slower
Significant (t=-50.19)

regex_dna

Mean +- std dev: 154 ms +- 1 ms -> 163 ms +- 1 ms: 1.06x slower
Significant (t=-58.46)

regex_effbot

Mean +- std dev: 2.80 ms +- 0.05 ms -> 2.73 ms +- 0.02 ms: 1.03x faster
Significant (t=10.54)

regex_v8

Mean +- std dev: 22.0 ms +- 0.1 ms -> 21.3 ms +- 0.1 ms: 1.03x faster
Significant (t=34.04)

richards

Mean +- std dev: 47.0 ms +- 0.8 ms -> 44.1 ms +- 0.5 ms: 1.06x faster
Significant (t=23.38)

richards_super

Mean +- std dev: 53.2 ms +- 1.1 ms -> 49.8 ms +- 1.0 ms: 1.07x faster
Significant (t=17.83)

scimark_fft

Mean +- std dev: 320 ms +- 4 ms -> 327 ms +- 5 ms: 1.02x slower
Significant (t=-8.71)

scimark_lu

Mean +- std dev: 108 ms +- 4 ms -> 107 ms +- 2 ms: 1.00x faster
Not significant

scimark_monte_carlo

Mean +- std dev: 61.9 ms +- 1.8 ms -> 63.3 ms +- 1.7 ms: 1.02x slower
Significant (t=-4.27)

scimark_sor

Mean +- std dev: 123 ms +- 3 ms -> 124 ms +- 3 ms: 1.01x slower
Not significant

scimark_sparse_mat_mult

Mean +- std dev: 4.08 ms +- 0.19 ms -> 4.85 ms +- 0.04 ms: 1.19x slower
Significant (t=-31.04)

spectral_norm

Mean +- std dev: 93.7 ms +- 0.5 ms -> 119.2 ms +- 1.7 ms: 1.27x slower
Significant (t=-112.63)

sqlglot_normalize

Mean +- std dev: 111 ms +- 1 ms -> 113 ms +- 1 ms: 1.02x slower
Not significant

sqlglot_optimize

Mean +- std dev: 55.0 ms +- 0.3 ms -> 56.5 ms +- 0.7 ms: 1.03x slower
Significant (t=-15.50)

sqlglot_parse

Mean +- std dev: 1.19 ms +- 0.01 ms -> 1.21 ms +- 0.02 ms: 1.02x slower
Significant (t=-9.11)

sqlglot_transpile

Mean +- std dev: 1.48 ms +- 0.02 ms -> 1.51 ms +- 0.02 ms: 1.02x slower
Significant (t=-11.45)

telco

Mean +- std dev: 8.00 ms +- 0.22 ms -> 7.99 ms +- 0.17 ms: 1.00x faster
Not significant

tomli_loads

Mean +- std dev: 2.00 sec +- 0.02 sec -> 2.05 sec +- 0.03 sec: 1.02x slower
Significant (t=-9.86)

tornado_http

Mean +- std dev: 126 ms +- 2 ms -> 124 ms +- 3 ms: 1.01x faster
Not significant

typing_runtime_protocols

Mean +- std dev: 117 us +- 2 us -> 117 us +- 2 us: 1.00x slower
Not significant

unpack_sequence

Mean +- std dev: 39.0 ns +- 0.2 ns -> 35.8 ns +- 0.4 ns: 1.09x faster
Significant (t=53.17)

unpickle

Mean +- std dev: 14.8 us +- 0.1 us -> 15.3 us +- 0.5 us: 1.03x slower
Significant (t=-6.83)

unpickle_list

Mean +- std dev: 4.53 us +- 0.04 us -> 4.72 us +- 0.04 us: 1.04x slower
Significant (t=-23.93)

unpickle_pure_python

Mean +- std dev: 208 us +- 2 us -> 224 us +- 8 us: 1.08x slower
Significant (t=-15.18)

xml_etree_generate

Mean +- std dev: 91.5 ms +- 1.0 ms -> 90.0 ms +- 1.3 ms: 1.02x faster
Not significant

xml_etree_iterparse

Mean +- std dev: 97.6 ms +- 1.6 ms -> 98.2 ms +- 1.0 ms: 1.01x slower
Not significant

xml_etree_parse

Mean +- std dev: 136 ms +- 2 ms -> 137 ms +- 2 ms: 1.01x slower
Not significant

xml_etree_process

Mean +- std dev: 60.6 ms +- 0.8 ms -> 60.3 ms +- 0.8 ms: 1.00x faster
Not significant

Fidget-Spinner · 2024-01-29T15:51:45Z

Stats for pyperf compare mainall.json jitall.json

Could you please rerun the command with pyperf compare_to mainall.json jitall.json -G --table --table-format=md please? It will group the results by speedup/slowdown and produce a markdown table so things are easier to read. Thanks!

JeffersGlass · 2024-01-29T17:24:29Z

Gladly! That is surely easier to read. Results are collapsed below.

Looks like overall a ~5% slowdown, with some significant outliers. I would guess the faster ones involve some specific opcode pairs from the list above in a useful way, enough to overcome the overhead of the additional compilation steps/superinstruction lookup.

pyperf stats as table

Benchmarks with tag 'apps':

Benchmark	mainall	jitall
2to3	261 ms	274 ms: 1.05x slower
chameleon	6.70 ms	6.78 ms: 1.01x slower
docutils	2.45 sec	2.49 sec: 1.02x slower
html5lib	62.0 ms	63.3 ms: 1.02x slower
tornado_http	126 ms	124 ms: 1.01x faster
Geometric mean	(ref)	1.02x slower

Benchmarks with tag 'asyncio':

Benchmark	mainall	jitall
async_tree_memoization	506 ms	512 ms: 1.01x slower
async_tree_none_tg	403 ms	408 ms: 1.01x slower
async_tree_eager_memoization_tg	200 ms	203 ms: 1.01x slower
async_tree_memoization_tg	507 ms	517 ms: 1.02x slower
async_tree_eager_memoization	257 ms	262 ms: 1.02x slower
async_tree_io	1.01 sec	1.04 sec: 1.02x slower
async_tree_io_tg	1.01 sec	1.04 sec: 1.02x slower
async_tree_none	396 ms	413 ms: 1.04x slower
async_tree_eager_tg	76.9 ms	80.4 ms: 1.05x slower
async_tree_eager	110 ms	121 ms: 1.10x slower
async_tree_cpu_io_mixed	657 ms	737 ms: 1.12x slower
async_tree_cpu_io_mixed_tg	662 ms	752 ms: 1.14x slower
async_tree_eager_cpu_io_mixed	411 ms	477 ms: 1.16x slower
async_tree_eager_cpu_io_mixed_tg	356 ms	417 ms: 1.17x slower
Geometric mean	(ref)	1.06x slower

Benchmark hidden because not significant (2): async_tree_eager_io, async_tree_eager_io_tg

Benchmarks with tag 'math':

Benchmark	mainall	jitall
float	74.2 ms	75.3 ms: 1.01x slower
nbody	78.0 ms	84.0 ms: 1.08x slower
pidigits	170 ms	189 ms: 1.11x slower
Geometric mean	(ref)	1.07x slower

Benchmarks with tag 'regex':

Benchmark	mainall	jitall
regex_v8	22.0 ms	21.3 ms: 1.03x faster
regex_effbot	2.80 ms	2.73 ms: 1.03x faster
regex_dna	154 ms	163 ms: 1.06x slower
regex_compile	127 ms	135 ms: 1.06x slower
Geometric mean	(ref)	1.02x slower

Benchmarks with tag 'serialize':

Benchmark	mainall	jitall
pickle_dict	31.8 us	30.7 us: 1.03x faster
pickle_list	4.88 us	4.79 us: 1.02x faster
xml_etree_generate	91.5 ms	90.0 ms: 1.02x faster
xml_etree_iterparse	97.6 ms	98.2 ms: 1.01x slower
pickle_pure_python	279 us	281 us: 1.01x slower
xml_etree_parse	136 ms	137 ms: 1.01x slower
json_loads	25.8 us	26.2 us: 1.01x slower
json_dumps	10.2 ms	10.4 ms: 1.02x slower
tomli_loads	2.00 sec	2.05 sec: 1.02x slower
unpickle	14.8 us	15.3 us: 1.03x slower
unpickle_list	4.53 us	4.72 us: 1.04x slower
unpickle_pure_python	208 us	224 us: 1.08x slower
Geometric mean	(ref)	1.01x slower

Benchmark hidden because not significant (2): xml_etree_process, pickle

Benchmarks with tag 'startup':

Benchmark	mainall	jitall
python_startup	11.5 ms	11.2 ms: 1.03x faster
Geometric mean	(ref)	1.00x faster

Benchmark hidden because not significant (1): python_startup_no_site

Benchmarks with tag 'template':

Benchmark	mainall	jitall
genshi_text	22.3 ms	21.8 ms: 1.02x faster
genshi_xml	52.4 ms	52.1 ms: 1.01x faster
mako	10.2 ms	11.4 ms: 1.12x slower
Geometric mean	(ref)	1.03x slower

All benchmarks:

Benchmark	mainall	jitall
generators	30.8 ms	26.2 ms: 1.18x faster
unpack_sequence	39.0 ns	35.8 ns: 1.09x faster
richards_super	53.2 ms	49.8 ms: 1.07x faster
richards	47.0 ms	44.1 ms: 1.06x faster
pickle_dict	31.8 us	30.7 us: 1.03x faster
regex_v8	22.0 ms	21.3 ms: 1.03x faster
mdp	2.91 sec	2.82 sec: 1.03x faster
regex_effbot	2.80 ms	2.73 ms: 1.03x faster
python_startup	11.5 ms	11.2 ms: 1.03x faster
genshi_text	22.3 ms	21.8 ms: 1.02x faster
pickle_list	4.88 us	4.79 us: 1.02x faster
xml_etree_generate	91.5 ms	90.0 ms: 1.02x faster
tornado_http	126 ms	124 ms: 1.01x faster
meteor_contest	96.9 ms	96.1 ms: 1.01x faster
genshi_xml	52.4 ms	52.1 ms: 1.01x faster
asyncio_websockets	435 ms	437 ms: 1.00x slower
deepcopy	337 us	339 us: 1.01x slower
xml_etree_iterparse	97.6 ms	98.2 ms: 1.01x slower
pickle_pure_python	279 us	281 us: 1.01x slower
async_tree_memoization	506 ms	512 ms: 1.01x slower
xml_etree_parse	136 ms	137 ms: 1.01x slower
async_tree_none_tg	403 ms	408 ms: 1.01x slower
chameleon	6.70 ms	6.78 ms: 1.01x slower
async_tree_eager_memoization_tg	200 ms	203 ms: 1.01x slower
asyncio_tcp_ssl	1.29 sec	1.31 sec: 1.01x slower
float	74.2 ms	75.3 ms: 1.01x slower
json_loads	25.8 us	26.2 us: 1.01x slower
asyncio_tcp	380 ms	385 ms: 1.01x slower
docutils	2.45 sec	2.49 sec: 1.02x slower
deepcopy_memo	33.9 us	34.5 us: 1.02x slower
json_dumps	10.2 ms	10.4 ms: 1.02x slower
deepcopy_reduce	3.09 us	3.15 us: 1.02x slower
sqlglot_normalize	111 ms	113 ms: 1.02x slower
async_tree_memoization_tg	507 ms	517 ms: 1.02x slower
async_generators	400 ms	408 ms: 1.02x slower
sqlglot_parse	1.19 ms	1.21 ms: 1.02x slower
dask	630 ms	643 ms: 1.02x slower
html5lib	62.0 ms	63.3 ms: 1.02x slower
async_tree_eager_memoization	257 ms	262 ms: 1.02x slower
scimark_monte_carlo	61.9 ms	63.3 ms: 1.02x slower
async_tree_io	1.01 sec	1.04 sec: 1.02x slower
scimark_fft	320 ms	327 ms: 1.02x slower
sqlglot_transpile	1.48 ms	1.51 ms: 1.02x slower
tomli_loads	2.00 sec	2.05 sec: 1.02x slower
dulwich_log	75.1 ms	76.9 ms: 1.02x slower
async_tree_io_tg	1.01 sec	1.04 sec: 1.02x slower
sqlglot_optimize	55.0 ms	56.5 ms: 1.03x slower
pathlib	20.3 ms	20.9 ms: 1.03x slower
logging_simple	6.33 us	6.53 us: 1.03x slower
unpickle	14.8 us	15.3 us: 1.03x slower
nqueens	84.9 ms	87.7 ms: 1.03x slower
logging_silent	97.1 ns	101 ns: 1.04x slower
logging_format	7.20 us	7.50 us: 1.04x slower
unpickle_list	4.53 us	4.72 us: 1.04x slower
async_tree_none	396 ms	413 ms: 1.04x slower
raytrace	242 ms	253 ms: 1.05x slower
async_tree_eager_tg	76.9 ms	80.4 ms: 1.05x slower
2to3	261 ms	274 ms: 1.05x slower
fannkuch	387 ms	406 ms: 1.05x slower
regex_dna	154 ms	163 ms: 1.06x slower
regex_compile	127 ms	135 ms: 1.06x slower
pprint_safe_repr	743 ms	794 ms: 1.07x slower
go	124 ms	133 ms: 1.07x slower
unpickle_pure_python	208 us	224 us: 1.08x slower
nbody	78.0 ms	84.0 ms: 1.08x slower
pyflate	403 ms	435 ms: 1.08x slower
pprint_pformat	1.52 sec	1.64 sec: 1.08x slower
comprehensions	15.9 us	17.3 us: 1.09x slower
deltablue	3.20 ms	3.49 ms: 1.09x slower
async_tree_eager	110 ms	121 ms: 1.10x slower
pidigits	170 ms	189 ms: 1.11x slower
mako	10.2 ms	11.4 ms: 1.12x slower
async_tree_cpu_io_mixed	657 ms	737 ms: 1.12x slower
gc_traversal	2.94 ms	3.30 ms: 1.12x slower
crypto_pyaes	64.8 ms	73.4 ms: 1.13x slower
async_tree_cpu_io_mixed_tg	662 ms	752 ms: 1.14x slower
async_tree_eager_cpu_io_mixed	411 ms	477 ms: 1.16x slower
async_tree_eager_cpu_io_mixed_tg	356 ms	417 ms: 1.17x slower
chaos	56.3 ms	65.9 ms: 1.17x slower
hexiom	5.79 ms	6.81 ms: 1.18x slower
scimark_sparse_mat_mult	4.08 ms	4.85 ms: 1.19x slower
spectral_norm	93.7 ms	119 ms: 1.27x slower
bench_mp_pool	10.5 ms	50.2 ms: 4.77x slower
Geometric mean	(ref)	1.05x slower

Benchmark hidden because not significant (12): bench_thread_pool, xml_etree_process, pickle, scimark_lu, coroutines, telco, create_gc_cycles, typing_runtime_protocols, scimark_sor, async_tree_eager_io, async_tree_eager_io_tg, python_startup_no_site

JeffersGlass · 2024-01-29T19:40:30Z

I took a preliminary pass at a tool for scoring sequences of UOps. Here's the results for the ~93 most-common-valid-pairs from above, comparing the lengths of the sum of the _code_body sections taken individually vs compiled into a single superinstruction:

UOp sequence scores from above (top 93 valid pairs)

UOps	Sum of _code_body for Individual Ops	length of _code_body Compiled Together	Percentage
_TO_BOOL_BOOL / _GUARD_IS_TRUE_POP	149	104	69.8%
_ITER_CHECK_LIST / _GUARD_NOT_EXHAUSTED_LIST	149	104	69.8%
_ITER_CHECK_TUPLE / _GUARD_NOT_EXHAUSTED_TUPLE	149	104	69.8%
_CHECK_VALIDITY / _TO_BOOL_BOOL	142	101	71.13%
_CHECK_VALIDITY / _RESUME_CHECK	138	100	72.46%
_TO_BOOL_STR / _GUARD_IS_TRUE_POP	272	200	73.53%
_GUARD_GLOBALS_VERSION / _GUARD_BUILTINS_VERSION	190	141	74.21%
_CHECK_VALIDITY / _GUARD_IS_FALSE_POP	145	109	75.17%
_GUARD_IS_TRUE_POP / _GUARD_GLOBALS_VERSION	171	130	76.02%
_RESUME_CHECK / _GUARD_GLOBALS_VERSION	164	125	76.22%
_GUARD_TYPE_VERSION / _CHECK_MANAGED_OBJECT_HAS_VALUES	194	151	77.84%
_STORE_FAST / _STORE_FAST	208	163	78.37%
_LOAD_FAST / _LOAD_CONST_INLINE_BORROW	76	60	78.95%
_LOAD_CONST_INLINE_BORROW / _LOAD_FAST	76	60	78.95%
_GUARD_BUILTINS_VERSION / _LOAD_GLOBAL_BUILTINS	241	192	79.67%
_GUARD_GLOBALS_VERSION / _LOAD_GLOBAL_MODULE	241	192	79.67%
_GUARD_NOT_EXHAUSTED_TUPLE / _ITER_NEXT_TUPLE	129	104	80.62%
_GUARD_NOT_EXHAUSTED_LIST / _ITER_NEXT_LIST	132	107	81.06%
_CHECK_VALIDITY / _CHECK_FUNCTION_EXACT_ARGS	221	180	81.45%
_GUARD_IS_FALSE_POP / _LOAD_CONST_INLINE_BORROW	106	87	82.08%
_CHECK_MANAGED_OBJECT_HAS_VALUES / _LOAD_ATTR_INSTANCE_VALUE	336	276	82.14%
_LOAD_CONST_INLINE_BORROW / _SET_IP	73	60	82.19%
_SET_IP / _LOAD_CONST_INLINE_BORROW	73	60	82.19%
_LOAD_CONST_INLINE_BORROW / _POP_FRAME	141	116	82.27%
_PUSH_FRAME / _CHECK_VALIDITY	137	114	83.21%
_BINARY_OP_ADD_INT / _STORE_FAST	310	258	83.23%
_POP_TOP / _LOAD_FAST	122	102	83.61%
_ITER_NEXT_TUPLE / _STORE_FAST	157	132	84.08%
_ITER_NEXT_LIST / _STORE_FAST	160	135	84.38%
_GUARD_IS_FALSE_POP / _LOAD_FAST	122	103	84.43%
_GUARD_IS_TRUE_POP / _LOAD_FAST	122	103	84.43%
_STORE_FAST / _LOAD_CONST_INLINE_BORROW	134	114	85.07%
_LOAD_FAST / _SET_IP	89	76	85.39%
_CHECK_VALIDITY / _GUARD_IS_NONE_POP	203	174	85.71%
_LOAD_ATTR_METHOD_NO_DICT / _SET_IP	92	79	85.87%
_SAVE_RETURN_OFFSET / _PUSH_FRAME	95	82	86.32%
_CHECK_VALIDITY / _UNPACK_SEQUENCE_TWO_TUPLE	257	222	86.38%
_LOAD_FAST / _GUARD_TYPE_VERSION	125	108	86.4%
_SET_IP / _CHECK_VALIDITY	112	97	86.61%
_STORE_FAST / _LOAD_FAST	150	130	86.67%
_CHECK_FUNCTION_EXACT_ARGS / _CHECK_STACK_SPACE	279	242	86.74%
_CHECK_VALIDITY / _LOAD_CONST_INLINE_BORROW	99	86	86.87%
_GUARD_TYPE_VERSION / _LOAD_ATTR_SLOT	295	257	87.12%
_GUARD_TYPE_VERSION / _LOAD_ATTR_METHOD_NO_DICT	128	112	87.5%
_SET_IP / _GUARD_BOTH_INT	126	111	88.1%
_FOR_ITER_TIER_TWO / _CHECK_VALIDITY	317	280	88.33%
_SET_IP / _GUARD_TYPE_VERSION	122	108	88.52%
_CHECK_VALIDITY / _LOAD_FAST	115	102	88.7%
_RESUME_CHECK / _LOAD_FAST	115	102	88.7%
_STORE_SUBSCR_DICT / _CHECK_VALIDITY	297	264	88.89%
_LOAD_ATTR_METHOD_NO_DICT / _CHECK_VALIDITY	118	105	88.98%
_COMPARE_OP_STR / _CHECK_VALIDITY	343	307	89.5%
_CHECK_VALIDITY / _COMPARE_OP_FLOAT	391	350	89.51%
_BINARY_SUBSCR_DICT / _CHECK_VALIDITY	368	330	89.67%
_CALL_METHOD_DESCRIPTOR_FAST / _CHECK_VALIDITY	550	496	90.18%
_LOAD_FAST / _TO_BOOL_STR	242	219	90.5%
_ITER_NEXT_LIST / _UNPACK_SEQUENCE_TWO_TUPLE	244	221	90.57%
_LOAD_FAST / _GUARD_GLOBALS_VERSION	141	128	90.78%
_LOAD_FAST / _BINARY_SUBSCR_STR_INT	474	432	91.14%
_GUARD_BOTH_INT / _BINARY_OP_ADD_INT	289	264	91.35%
_POP_FRAME / _CHECK_VALIDITY	180	165	91.67%
_CHECK_VALIDITY / _CALL_ISINSTANCE	491	452	92.06%
_GUARD_IS_NONE_POP / _SET_IP	177	164	92.66%
_CALL_ISINSTANCE / _CHECK_VALIDITY	491	456	92.87%
_LOAD_GLOBAL_MODULE / _SET_IP	189	176	93.12%
_LOAD_GLOBAL_BUILTINS / _SET_IP	189	176	93.12%
_LOAD_ATTR_INSTANCE_VALUE / _STORE_FAST	325	303	93.23%
_CHECK_VALIDITY / _CALL_METHOD_DESCRIPTOR_FAST	550	513	93.27%
_SET_IP / _STORE_SUBSCR_DICT	271	253	93.36%
_CHECK_VALIDITY / _POP_TOP	145	136	93.79%
_SET_IP / _CHECK_FUNCTION_EXACT_ARGS	195	183	93.85%
_SET_IP / _COMPARE_OP_STR	317	299	94.32%
_CHECK_VALIDITY / _POP_FRAME	180	170	94.44%
_UNPACK_SEQUENCE_TWO_TUPLE / _STORE_FAST	292	276	94.52%
_SET_IP / _LOAD_ATTR	341	323	94.72%
_SET_IP / _BINARY_SUBSCR_DICT	342	324	94.74%
_SET_IP / _CONTAINS_OP	286	271	94.76%
_CHECK_VALIDITY / _STORE_FAST	173	164	94.8%
_SET_IP / _FOR_ITER_TIER_TWO	291	276	94.85%
_SET_IP / _CALL_METHOD_DESCRIPTOR_FAST	524	497	94.85%
_STORE_FAST / _GUARD_GLOBALS_VERSION	199	190	95.48%
_LOAD_ATTR / _SET_IP	341	326	95.6%
_BINARY_SUBSCR_DICT / _SET_IP	342	327	95.61%
_SET_IP / _COMPARE_OP_INT	411	393	95.62%
_CONTAINS_OP / _CHECK_VALIDITY	312	301	96.47%
_INIT_CALL_PY_EXACT_ARGS / _SAVE_RETURN_OFFSET	781	754	96.54%
_LOAD_ATTR / _CHECK_VALIDITY	367	355	96.73%
_SET_IP / _CALL_ISINSTANCE	465	452	97.2%
_CALL_METHOD_DESCRIPTOR_FAST / _SET_IP	524	511	97.52%
_CHECK_STACK_SPACE / _INIT_CALL_PY_EXACT_ARGS	881	867	98.41%
_LOAD_ATTR_SLOT / _SET_IP	259	255	98.46%
_LOAD_GLOBAL_BUILTINS / _LOAD_FAST	192	190	98.96%
_LOAD_GLOBAL_MODULE / _LOAD_FAST	192	190	98.96%

This list is using "length of code_body" as a proxy for "speed", which isn't going to be exactly correct. I.e, I don't expect a condensed superinstruction that's 70% the length of the sum of its parts to be 70% faster. But it's a simple heuristic to start with.

This is all on X86_64 Linux, using my overly-dynamic-template.c from the above example. With a different template.c, these raw numbers could be slightly different, but I think the relative effects would be similar.

It's fun that the top pair there, _TO_BOOL_BOOL / _GUARD_IS_TRUE_POP, is something that Brandt had called out as actually being a good candidate for being condensable. The other top candidates in this list - _ITER_CHECK_X / _GUARD_NOT_EXHAUSTED_X - also make sense.

I will get that tool tidied up as well - now that the main JIT branch has merged into main, that's a relatively straightforward PR, if having the tool in main would be generally useful. If not, I can publish it as a separate tool.

Also, I know I've gone on a bit of a tear here this week. If this is more spamming than useful, I'm happy to move these ideas/projects elsewhere.

Fidget-Spinner · 2024-01-30T01:35:26Z

Wait I just realised if you were benchmarking main that is the cause of the slowdown -- main did not have the JIT merged yet when you benchmarked it. You need to benchmark main+JIT vs your-branch+JIT.

PS: main just got JIT merged in and it is currently faster without JIT than with JIT, due to micro operations overhead. So that explains your 5% slowdown. Actually if it's only 5% that's a huge improvement. It should be 7-10% slower. Which means your change might have caused a 2% speedup over the current JIT!

mdboom · 2024-01-31T16:20:12Z

I maintain the benchmarking infrastructure for the team, so happy to answer any questions related to pyperformance etc.

@JeffersGlass wrote:

Is there a simple way to run pyperformance locally with pystats enabled?

If you have a build with --enable-pystats configured, pyperformance will automatically collect stats, and it's smart enough to exclude stats from the pyperformance/pyperf harness itself. I just do:

rm /tmp/py_stats/*
pyperformance run --python cpython/python {...any other args you are passing to pyperformance...}
Tools/scripts/summarize_stats.py

I realize I don't know how to build/run the Tier 2 interpreter without the JIT. I found the -X uops arg... is there a flag to pass to configure as well?

There's no configure flag in this case, but you need a non-JIT build. The easiest way to run this with pyperformance is with the PYTHON_UOPS=1 environment variable and telling pyperformance to pass it to its child processes with the --inherit-environ flag.

PYTHON_UOPS=1 pyperformance run --inherit-environ PYTHON_UOPS

Also, if you ever want to have a branch run on the official infrastructure, just ping a member of the Microsoft team on Discord -- it's very easy, but unfortunately we can't make it "self-serve on the open web" for security reasons.

JeffersGlass · 2024-02-06T04:25:45Z

Thank you @mdboom for the information, it's been very helpful. I will gladly take you and the team up on running some benchmarks on official infrastructure once things stabilize a bit, I'd be curious how much difference running with, say, the top ~1000 most-promising opcode sequences makes.

Apologies for not responding sooner - as I mentioned to Brandt, I'm moving house at the moment, and it's significantly cutting into my spare time to dig into this.

That said, I am still working on getting pairs/triples/sequences of UOp counts into PyStats. I have a branch (uop-sequence-count) that successfully achieves this (for the Non-JIT tier 2 interpreter only, for now). Only two things remain before I submit it as a PR - making the maximum sequence length adjustable by an environment variable, and re-adding functionality to summarize_stats.py to display those stats in a clean way. I hope to get those worked out this week.

markshannon · 2024-02-08T10:59:43Z

Thanks for working on this. Having stats for tier 2 pairs would be really useful.
I suspect that triples might be a bit too bulky, and chains too slow, but feel free to prove me wrong.

I don't think we want to implement superinstructions yet, as it will complicate register allocation and generating multiple stencils for instructions like LOAD_FAST where we want to fully inline the oparg.

When we do want superinstructions, in the not too far future, having the stats will help a lot.

JeffersGlass · 2024-02-08T20:52:59Z

Thanks @markshannon, that PR is now live. It includes the ability to track pairs, triples or sequences of any length, but it defaults to only counting pairs.

10-4 on holding off on implementing superinstructions for a bit. With the ability to collect stats, I'd like to continue to play around and see if what results from adding longer chains of superinstructions, and their consequences for performance/size. I'll post results here as they come along.

brandtbucher · 2024-02-09T01:22:35Z

Another thing I'd be interested in seeing (and that we may be able to incorporate sooner) is common superinstructions that can be formed without changing the existing instruction format. Meaning, there is at most one each of oparg, target, and operand used by all of the parts combined.

For example, _TO_BOOL_BOOL/_GUARD_IS_TRUE_POP wouldn't work, because there are two targets. But _LOAD_FAST / _LOAD_CONST_INLINE_BORROW would, since one half uses an oparg and the other half uses an operand (and neither uses target). So a combined opcode could be added without changing the instruction format.

(A related, but hairier, question would be identifying pairs with at most one unique value for each member. So _CHECK_VALIDITY / _TO_BOOL_BOOL could work, but only if both halves share the same target).

JeffersGlass · 2024-02-20T23:38:58Z

I've been doing a little work around this idea:

Common superinstructions that can be formed without changing the existing instruction format. Meaning, there is at most one each of oparg, target, and operand used by all of the parts combined.

Here's what the output could look like, if I understand the requirements correctly (which I may not have)

Pair counts for top 100 uop pairs

Pair	Count	Self	Cumulative	Oparg/Operand/Target Overlap
_LOAD_FAST _LOAD_FAST	4,407,701,606	5.2%	5.2%	Oparg
_LOAD_FAST _SET_IP	3,705,075,653	4.4%	9.6%	No Overlap
_LOAD_CONST_INLINE_BORROW _SET_IP	3,699,847,452	4.4%	13.9%	Operand
_LOAD_FAST _LOAD_CONST_INLINE_BORROW	3,556,388,194	4.2%	18.1%	No Overlap
_STORE_FAST _LOAD_FAST	2,970,845,320	3.5%	21.6%	Oparg
_GUARD_IS_FALSE_POP _LOAD_FAST	2,418,045,803	2.8%	24.4%	No Overlap
_CHECK_VALIDITY _GUARD_IS_FALSE_POP	2,415,103,501	2.8%	27.3%	Target
_SET_IP _GUARD_BOTH_INT	1,918,954,490	2.3%	29.5%	No Overlap
_CHECK_VALIDITY _LOAD_FAST	1,806,164,629	2.1%	31.7%	No Overlap
_GUARD_BOTH_INT _BINARY_OP_ADD_INT	1,573,924,331	1.9%	33.5%	No Overlap
_COMPARE_OP_STR _CHECK_VALIDITY	1,345,464,263	1.6%	35.1%	No Overlap
_SET_IP _COMPARE_OP_STR	1,345,143,263	1.6%	36.7%	No Overlap
_LOAD_FAST _GUARD_TYPE_VERSION	1,200,831,945	1.4%	38.1%	No Overlap
_CONTAINS_OP _CHECK_VALIDITY	1,152,094,025	1.4%	39.5%	No Overlap
_SET_IP _CONTAINS_OP	1,124,654,020	1.3%	40.8%	No Overlap
_CHECK_VALIDITY _STORE_FAST	1,007,789,705	1.2%	42.0%	No Overlap
_SET_IP _GUARD_TYPE_VERSION	969,652,291	1.1%	43.1%	Operand
_BINARY_OP_ADD_INT _STORE_FAST	926,230,700	1.1%	44.2%	No Overlap
_SET_IP _CHECK_VALIDITY	923,223,016	1.1%	45.3%	No Overlap
_JUMP_TO_TOP _LOAD_FAST	900,085,718	1.1%	46.3%	No Overlap
_LOAD_FAST _BINARY_SUBSCR_STR_INT	878,363,940	1.0%	47.4%	No Overlap
_ITER_CHECK_LIST _GUARD_NOT_EXHAUSTED_LIST	779,199,392	0.9%	48.3%	Target
_STORE_FAST _JUMP_TO_TOP	753,504,980	0.9%	49.2%	No Overlap
_GUARD_TYPE_VERSION _CHECK_MANAGED_OBJECT_HAS_VALUES	726,049,155	0.9%	50.0%	Target
_CHECK_MANAGED_OBJECT_HAS_VALUES _LOAD_ATTR_INSTANCE_VALUE	726,049,155	0.9%	50.9%	Target
_BINARY_SUBSCR_STR_INT _STORE_FAST	678,699,540	0.8%	51.7%	No Overlap
_LOAD_FAST _GUARD_BOTH_FLOAT	673,582,800	0.8%	52.5%	No Overlap
_STORE_FAST _STORE_FAST	638,768,869	0.8%	53.2%	Oparg
_CHECK_FUNCTION_EXACT_ARGS _CHECK_STACK_SPACE	630,682,238	0.7%	54.0%	Oparg,Target
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS	630,682,238	0.7%	54.7%	Oparg
_SAVE_RETURN_OFFSET _PUSH_FRAME	630,682,238	0.7%	55.5%	No Overlap
_INIT_CALL_PY_EXACT_ARGS _SAVE_RETURN_OFFSET	630,682,238	0.7%	56.2%	Oparg
_BINARY_SUBSCR _CHECK_VALIDITY	625,930,060	0.7%	56.9%	No Overlap
_GUARD_NOT_EXHAUSTED_LIST _ITER_NEXT_LIST	620,218,192	0.7%	57.7%	No Overlap
_GUARD_BOTH_FLOAT _BINARY_OP_MULTIPLY_FLOAT	583,876,920	0.7%	58.4%	No Overlap
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST	580,613,866	0.7%	59.0%	No Overlap
_PUSH_FRAME _CHECK_VALIDITY	560,185,027	0.7%	59.7%	No Overlap
_CHECK_VALIDITY _RESUME_CHECK	552,221,449	0.7%	60.4%	Target
_SET_IP _BINARY_SUBSCR	542,547,460	0.6%	61.0%	No Overlap
_GUARD_TYPE_VERSION _GUARD_DORV_VALUES_INST_ATTR_FROM_DICT	503,214,425	0.6%	61.6%	Target
_GUARD_DORV_VALUES_INST_ATTR_FROM_DICT _GUARD_KEYS_VERSION	503,197,385	0.6%	62.2%	Target
_SET_IP _CHECK_FUNCTION_EXACT_ARGS	481,795,536	0.6%	62.7%	Operand
_ITER_CHECK_RANGE _GUARD_NOT_EXHAUSTED_RANGE	476,451,707	0.6%	63.3%	Target
_GUARD_KEYS_VERSION _LOAD_ATTR_METHOD_WITH_VALUES	466,355,977	0.5%	63.9%	Operand
_SET_IP _ITER_CHECK_RANGE	464,074,716	0.5%	64.4%	No Overlap
_GUARD_NOT_EXHAUSTED_RANGE _ITER_NEXT_RANGE	447,347,483	0.5%	64.9%	No Overlap
_ITER_NEXT_RANGE _CHECK_VALIDITY	446,621,723	0.5%	65.5%	No Overlap
_LOAD_ATTR_METHOD_WITH_VALUES _CHECK_VALIDITY	427,919,622	0.5%	66.0%	No Overlap
_CHECK_VALIDITY _GUARD_IS_TRUE_POP	427,035,067	0.5%	66.5%	Target
_GUARD_TYPE_VERSION _LOAD_ATTR_SLOT	402,486,129	0.5%	66.9%	Operand,Target
_TO_BOOL_BOOL _GUARD_IS_FALSE_POP	389,785,090	0.5%	67.4%	Target
_ITER_NEXT_LIST _STORE_FAST	388,388,340	0.5%	67.9%	No Overlap
_BINARY_OP_ADD_INT _SET_IP	385,864,500	0.5%	68.3%	No Overlap
_SET_IP _BINARY_OP	355,010,336	0.4%	68.7%	No Overlap
_GUARD_TYPE_VERSION _LOAD_ATTR_METHOD_NO_DICT	352,033,430	0.4%	69.1%	Operand
_RESUME_CHECK _LOAD_FAST	334,378,537	0.4%	69.5%	No Overlap
_UNPACK_SEQUENCE_TWO_TUPLE _STORE_FAST	325,547,809	0.4%	69.9%	Oparg
_CHECK_GLOBALS _CHECK_BUILTINS	293,902,194	0.3%	70.3%	Operand,Target
_SET_IP _COMPARE_OP_INT	291,179,987	0.3%	70.6%	No Overlap
_ITER_CHECK_TUPLE _GUARD_NOT_EXHAUSTED_TUPLE	283,729,558	0.3%	70.9%	Target
_COMPARE_OP_INT _CHECK_VALIDITY	283,685,807	0.3%	71.3%	Target
_TO_BOOL_BOOL _GUARD_IS_TRUE_POP	282,185,546	0.3%	71.6%	Target
_JUMP_TO_TOP _SET_IP	281,640,840	0.3%	71.9%	No Overlap
_SET_IP _LOAD_DEREF	271,925,705	0.3%	72.3%	No Overlap
_LOAD_ATTR_INSTANCE_VALUE _SET_IP	271,586,954	0.3%	72.6%	Operand
_GUARD_IS_TRUE_POP _LOAD_FAST	266,152,219	0.3%	72.9%	No Overlap
_GUARD_BOTH_FLOAT _BINARY_OP_ADD_FLOAT	264,218,020	0.3%	73.2%	No Overlap
_BINARY_OP_MULTIPLY_FLOAT _GUARD_BOTH_FLOAT	263,269,040	0.3%	73.5%	No Overlap
_LOAD_CONST_INLINE_BORROW _LOAD_CONST_INLINE_BORROW	262,087,280	0.3%	73.8%	Operand
_LOAD_DEREF _CHECK_VALIDITY	262,007,699	0.3%	74.1%	No Overlap
_SET_IP _CALL_BUILTIN_FAST	261,271,024	0.3%	74.4%	No Overlap
_CALL_BUILTIN_FAST _CHECK_VALIDITY	260,762,504	0.3%	74.7%	Target
_GUARD_IS_TRUE_POP _JUMP_TO_TOP	255,131,549	0.3%	75.0%	No Overlap
_STORE_FAST _SET_IP	253,836,446	0.3%	75.3%	No Overlap
_CHECK_VALIDITY _TO_BOOL_BOOL	252,407,392	0.3%	75.6%	Target
_LOAD_CONST_INLINE _SET_IP	244,348,904	0.3%	75.9%	Operand
_SWAP _SET_IP	238,790,790	0.3%	76.2%	No Overlap
_LOAD_ATTR_SLOT _SET_IP	237,960,883	0.3%	76.5%	Operand
_CHECK_VALIDITY _LOAD_CONST_INLINE_BORROW	237,291,046	0.3%	76.8%	No Overlap
_LOAD_CONST_INLINE_BORROW _LOAD_FAST	234,078,060	0.3%	77.0%	No Overlap
_ITER_NEXT_LIST _UNPACK_SEQUENCE_TWO_TUPLE	227,025,132	0.3%	77.3%	No Overlap
_SET_IP _LOAD_ATTR	223,273,337	0.3%	77.6%	No Overlap
_PUSH_NULL _LOAD_FAST	221,336,879	0.3%	77.8%	No Overlap
_STORE_SUBSCR_LIST_INT _CHECK_VALIDITY	220,401,840	0.3%	78.1%	Target
_SET_IP _STORE_SUBSCR_LIST_INT	220,401,840	0.3%	78.4%	No Overlap
_COPY _COPY	215,498,100	0.3%	78.6%	Oparg
_SWAP _SWAP	215,498,100	0.3%	78.9%	Oparg
_LOAD_ATTR_METHOD_NO_DICT _SET_IP	211,443,420	0.2%	79.1%	Operand
_GUARD_BOTH_INT _BINARY_OP_SUBTRACT_INT	210,751,793	0.2%	79.4%	No Overlap
_LOAD_ATTR_INSTANCE_VALUE _LOAD_FAST	203,211,931	0.2%	79.6%	Oparg
_CHECK_VALIDITY _EXIT_TRACE	202,589,508	0.2%	79.8%	Target
_BINARY_SUBSCR_STR_INT _LOAD_FAST	199,431,900	0.2%	80.1%	No Overlap
_LOAD_FAST _LOAD_CONST_INLINE	197,928,387	0.2%	80.3%	No Overlap
_SET_IP _FOR_ITER_TIER_TWO	193,904,331	0.2%	80.5%	No Overlap
_GUARD_BOTH_FLOAT _BINARY_OP_SUBTRACT_FLOAT	189,087,717	0.2%	80.8%	No Overlap
_STORE_SUBSCR _CHECK_VALIDITY	189,007,080	0.2%	81.0%	No Overlap
_SET_IP _STORE_SUBSCR	189,007,080	0.2%	81.2%	No Overlap
_GUARD_NOT_EXHAUSTED_TUPLE _ITER_NEXT_TUPLE	187,387,819	0.2%	81.4%	No Overlap
_ITER_NEXT_TUPLE _STORE_FAST	186,951,639	0.2%	81.6%	No Overlap
_LOAD_ATTR _CHECK_VALIDITY	184,252,330	0.2%	81.9%	No Overlap

The conditions I think I understand for whether each UOp uses the three input kinds is:

oparg can be detected using the HAS_OPARG flag already present in the metadata
operand can be detected using the same logic in the Tier2 generator, which I'd extract to a new HAS_OPERAND flag
target can be detected if the uop has the HAS_JUMP, HAS_EXIT, or HAS_DEOPT flag... yes? This is the one I'm least confident of my understanding on.

Fidget-Spinner · 2024-02-24T15:24:36Z

I propose that we automatically generate templates for all permutations within a single macro of uops as well.

Here's an example:

        macro(CALL_PY_EXACT_ARGS) =
            unused/1 + // Skip over the counter
            _CHECK_PEP_523 +
            _CHECK_FUNCTION_EXACT_ARGS +
            _CHECK_STACK_SPACE +
            _INIT_CALL_PY_EXACT_ARGS +
            _SAVE_RETURN_OFFSET +
            _PUSH_FRAME;

Some stuff can be eliminated by guard elimination. A simple heuristic would be

if A --> B and B --> C
AND A, B, C all part of the same macroinstruction,

then we should fuse them and make the super instructions A + B + C, B + C, A + B.

The second condition is crucial -- because we know this chain is not speculative, it is guaranteed to occur.
The heuristic would automatically generate the super instruction for the above macro.
_CHECK_PEP_523 + _CHECK_FUNCTION_EXACT_ARGS + _CHECK_STACK_SPACE + _INIT_CALL_PY_EXACT_ARGS + _SAVE_RETURN_OFFSET + _PUSH_FRAME and all other valid permutation chains.

From the table above, there is indeed a commonly occurring permutation: _CHECK_FUNCTION_EXACT_ARGS + _CHECK_STACK_SPACE + _INIT_CALL_PY_EXACT_ARGS + _SAVE_RETURN_OFFSET + _PUSH_FRAME

_CHECK_FUNCTION_EXACT_ARGS _CHECK_STACK_SPACE	630,682,238	0.7%	54.0%	Oparg,Target
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS	630,682,238	0.7%	54.7%	Oparg
_SAVE_RETURN_OFFSET _PUSH_FRAME	630,682,238	0.7%	55.5%	No Overlap
_INIT_CALL_PY_EXACT_ARGS _SAVE_RETURN_OFFSET	630,682,238	0.7%	56.2%	Oparg

To find the longest chain at runtime, we can automatically generate another abstract interpreter from the macro definition that finds longest matching chains on the first occurence of an instruction it sees, and replaces them. I can work on this part if y'all are keen on the idea.

Reasoning:
Uops are a great IR for the tier 2 optimizer and analysis, but not ideal for optimized machine code generation. They mainly have two negatives right now:

Smaller region of code means LLVM can optimize less. The sum of a chain of uops together is greater than the equivalent macro instruction.
More dispatch overhead.

Currently the optimizer doesn't optimize all instructions. Just as a stat for figurative purposes, only 30% of _BINARY_OP_ADD_INT is eliminated. We should regain their original performance by combining them back to their equivalent macro instruction if possible.

upper bound
The maximum permutations a sequence of uops can have, only allowing sequence >= 2, and valid order of uops, is
$$\sum_{n=2}^{n=k} 2^{n-2}$$
where k is the number of uops in that macro instruction.

The macro with the highest uop count is CALL_BOUND_METHOD_EXACT_ARGS with 8 instructions. That means 126 possible permutations. That's IMO pretty acceptable.

JeffersGlass · 2024-02-25T23:55:55Z

Here's some updated pair counts after rebasing from main. Also some performance data, exploring which subsets of superinstructions might be most valuable. For the moment, this incorporates both "format compatible" superinstructions that don't have overlapping oparg/operand/target, as well as incompatible ones that have overlap.

These are the top 500 Uop pairs when running pyperformance, as of where main was on Friday evening:

Pair counts for top 500 uop pairs

Pairs of specialized operations that deoptimize and are then followed by
the corresponding unspecialized instruction are not counted as pairs.

Pair	Count	Self	Cumulative
_LOAD_CONST_INLINE_BORROW _SET_IP	4,918,363,112	3.9%	3.9%
_CHECK_VALIDITY _GUARD_IS_FALSE_POP	2,969,434,496	2.4%	6.3%
_START_EXECUTOR _CHECK_VALIDITY	2,264,639,698	1.8%	8.1%
_SET_IP _GUARD_BOTH_INT	2,191,103,868	1.7%	9.8%
_LOAD_FAST_0 _GUARD_TYPE_VERSION	1,459,822,326	1.2%	11.0%
_GUARD_BOTH_INT _BINARY_OP_ADD_INT	1,423,704,735	1.1%	12.1%
_JUMP_TO_TOP _CHECK_VALIDITY	1,420,911,788	1.1%	13.3%
_GUARD_IS_FALSE_POP _LOAD_FAST_7	1,399,351,760	1.1%	14.4%
_COMPARE_OP_STR _CHECK_VALIDITY	1,355,624,988	1.1%	15.4%
_LOAD_FAST_7 _LOAD_CONST_INLINE_BORROW	1,353,346,729	1.1%	16.5%
_CONTAINS_OP _CHECK_VALIDITY	1,337,942,123	1.1%	17.6%
_SET_IP _GUARD_TYPE_VERSION	1,335,297,273	1.1%	18.6%
_GUARD_TYPE_VERSION _CHECK_MANAGED_OBJECT_HAS_VALUES	1,304,807,114	1.0%	19.7%
_CHECK_MANAGED_OBJECT_HAS_VALUES _LOAD_ATTR_INSTANCE_VALUE_0	1,300,440,674	1.0%	20.7%
_SET_IP _CONTAINS_OP	1,285,270,275	1.0%	21.7%
_LOAD_FAST_1 _SET_IP	1,216,476,888	1.0%	22.7%
_CHECK_VALIDITY _LOAD_FAST_0	1,093,152,233	0.9%	23.6%
_LOAD_FAST_3 _SET_IP	1,014,034,280	0.8%	24.4%
_LOAD_FAST_0 _LOAD_FAST_1	988,738,960	0.8%	25.2%
_CHECK_VALIDITY _LOAD_FAST_1	952,849,438	0.8%	25.9%
_CHECK_VALIDITY _ITER_CHECK_LIST	917,814,385	0.7%	26.7%
_LOAD_FAST_1 _LOAD_CONST_INLINE_BORROW	904,887,014	0.7%	27.4%
_LOAD_FAST_1 _BINARY_SUBSCR_STR_INT	881,863,700	0.7%	28.1%
_ITER_CHECK_LIST _GUARD_NOT_EXHAUSTED_LIST	874,993,105	0.7%	28.8%
_BINARY_OP_ADD_INT _STORE_FAST_1	857,669,140	0.7%	29.5%
_BINARY_SUBSCR _CHECK_VALIDITY	848,787,380	0.7%	30.1%
_LOAD_FAST_0 _SET_IP	846,004,624	0.7%	30.8%
_TO_BOOL_BOOL _GUARD_IS_FALSE_POP	830,506,949	0.7%	31.5%
_CHECK_FUNCTION_EXACT_ARGS _CHECK_STACK_SPACE	822,036,631	0.7%	32.1%
_SAVE_RETURN_OFFSET _PUSH_FRAME	822,020,431	0.7%	32.8%
_PUSH_FRAME _CHECK_VALIDITY	821,630,772	0.7%	33.4%
_CHECK_VALIDITY _RESUME_CHECK	813,263,876	0.6%	34.1%
_STORE_FAST _STORE_FAST	810,767,040	0.6%	34.7%
_CHECK_VALIDITY _SET_IP	807,454,833	0.6%	35.4%
_GUARD_IS_FALSE_POP _LOAD_FAST_1	746,213,560	0.6%	36.0%
_CHECK_VALIDITY _TO_BOOL_BOOL	727,644,702	0.6%	36.5%
_GUARD_NOT_EXHAUSTED_LIST _ITER_NEXT_LIST	727,407,997	0.6%	37.1%
_CALL_BUILTIN_FAST _CHECK_VALIDITY	714,251,953	0.6%	37.7%
_SET_IP _CALL_BUILTIN_FAST	713,623,873	0.6%	38.3%
_CHECK_GLOBALS _CHECK_BUILTINS	711,616,736	0.6%	38.8%
_CHECK_VALIDITY _GUARD_IS_TRUE_POP	705,646,254	0.6%	39.4%
_GUARD_BOTH_UNICODE _COMPARE_OP_STR	695,452,528	0.6%	39.9%
_SET_IP _GUARD_BOTH_UNICODE	694,580,668	0.6%	40.5%
_START_EXECUTOR _CHECK_VALIDITY_AND_SET_IP	694,400,296	0.6%	41.1%
_LOAD_FAST_5 _LOAD_CONST_INLINE_BORROW	686,152,960	0.5%	41.6%
_STORE_FAST_7 _LOAD_FAST_7	685,391,500	0.5%	42.1%
_LOAD_FAST_7 _LOAD_FAST_3	678,857,760	0.5%	42.7%
_BINARY_SUBSCR_STR_INT _STORE_FAST_7	674,395,140	0.5%	43.2%
_STORE_FAST_1 _JUMP_TO_TOP	663,656,340	0.5%	43.7%
_SET_IP _COMPARE_OP_STR	660,496,080	0.5%	44.3%
_SET_IP _BINARY_SUBSCR	655,631,060	0.5%	44.8%
_LOAD_FAST _SET_IP	650,471,872	0.5%	45.3%
_LOAD_FAST _LOAD_CONST_INLINE_BORROW	638,677,160	0.5%	45.8%
_LOAD_ATTR _CHECK_VALIDITY	632,782,284	0.5%	46.3%
_CHECK_VALIDITY _LOAD_FAST	631,359,031	0.5%	46.8%
_GUARD_TYPE_VERSION _GUARD_DORV_VALUES_INST_ATTR_FROM_DICT	601,867,358	0.5%	47.3%
_GUARD_DORV_VALUES_INST_ATTR_FROM_DICT _GUARD_KEYS_VERSION	601,865,498	0.5%	47.8%
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST_5	595,916,760	0.5%	48.3%
_LOAD_FAST_2 _SET_IP	580,879,838	0.5%	48.7%
_LOAD_FAST_4 _SET_IP	578,566,486	0.5%	49.2%
_SET_IP _CHECK_FUNCTION_EXACT_ARGS	571,568,739	0.5%	49.6%
_GUARD_KEYS_VERSION _LOAD_ATTR_METHOD_WITH_VALUES	554,428,258	0.4%	50.1%
_GUARD_BOTH_INT _COMPARE_OP_INT	552,599,400	0.4%	50.5%
_GUARD_BOTH_FLOAT _BINARY_OP_MULTIPLY_FLOAT	547,321,100	0.4%	51.0%
_COMPARE_OP_INT _CHECK_VALIDITY	546,886,480	0.4%	51.4%
_TO_BOOL_BOOL _GUARD_IS_TRUE_POP	531,008,516	0.4%	51.8%
_CHECK_VALIDITY _STORE_FAST	523,929,242	0.4%	52.2%
_CHECK_VALIDITY _LOAD_CONST_INLINE_BORROW	522,579,909	0.4%	52.6%
_STORE_FAST _LOAD_FAST	522,159,679	0.4%	53.1%
_LOAD_CONST_INLINE_BORROW _LOAD_CONST_INLINE_BORROW	517,421,780	0.4%	53.5%
_CHECK_VALIDITY _EXIT_TRACE	512,905,025	0.4%	53.9%
_ITER_CHECK_RANGE _GUARD_NOT_EXHAUSTED_RANGE	512,616,412	0.4%	54.3%
_SET_IP _ITER_CHECK_RANGE	497,802,170	0.4%	54.7%
_SET_IP _LOAD_ATTR	497,569,794	0.4%	55.1%
_GUARD_NOT_EXHAUSTED_RANGE _ITER_NEXT_RANGE	487,463,800	0.4%	55.5%
_ITER_NEXT_RANGE _CHECK_VALIDITY	486,047,800	0.4%	55.9%
_LOAD_FAST _LOAD_FAST	464,047,634	0.4%	56.2%
_LOAD_ATTR_METHOD_WITH_VALUES _CHECK_VALIDITY	463,541,726	0.4%	56.6%
_GUARD_TYPE_VERSION _LOAD_ATTR_METHOD_NO_DICT	453,191,279	0.4%	57.0%
_SET_IP _BINARY_OP	449,837,780	0.4%	57.3%
_GUARD_TYPE_VERSION _LOAD_ATTR_SLOT_0	444,413,782	0.4%	57.7%
_RESUME_CHECK _LOAD_FAST_0	425,106,151	0.3%	58.0%
_BINARY_OP_ADD_INT _SET_IP	416,290,640	0.3%	58.3%
_LOAD_FAST_6 _LOAD_CONST_INLINE_BORROW	405,386,040	0.3%	58.7%
_GUARD_IS_FALSE_POP _LOAD_CONST_INLINE_WITH_NULL	403,364,660	0.3%	59.0%
_SET_IP _BINARY_OP_ADD_INT	382,082,580	0.3%	59.3%
_CHECK_VALIDITY _LOAD_FAST_2	381,757,711	0.3%	59.6%
_CALL_BUILTIN_O _CHECK_VALIDITY	380,153,457	0.3%	59.9%
_LOAD_FAST_3 _LOAD_FAST_4	372,227,783	0.3%	60.2%
_SET_IP _CALL_BUILTIN_O	365,069,977	0.3%	60.5%
_CHECK_BUILTINS _LOAD_CONST_INLINE_WITH_NULL	356,017,693	0.3%	60.8%
_LOAD_ATTR_INSTANCE_VALUE_0 _SET_IP	349,455,207	0.3%	61.0%
_CHECK_VALIDITY _ITER_CHECK_TUPLE	340,832,212	0.3%	61.3%
_CHECK_VALIDITY_AND_SET_IP _LOAD_ATTR	336,466,159	0.3%	61.6%
_LOAD_CONST_INLINE _SET_IP	325,421,257	0.3%	61.8%
_SWAP _SET_IP	319,885,628	0.3%	62.1%
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST_1	315,916,373	0.3%	62.3%
_LOAD_FAST_2 _LOAD_FAST_3	296,105,320	0.2%	62.6%
_SET_IP _LOAD_DEREF	292,326,540	0.2%	62.8%
_ITER_CHECK_TUPLE _GUARD_NOT_EXHAUSTED_TUPLE	288,779,632	0.2%	63.0%
_LOAD_ATTR_INSTANCE_VALUE_0 _TO_BOOL_BOOL	286,918,299	0.2%	63.3%
_LOAD_DEREF _CHECK_VALIDITY	280,070,820	0.2%	63.5%
_CHECK_VALIDITY _STORE_FAST_6	268,077,620	0.2%	63.7%
_LOAD_ATTR_SLOT_0 _SET_IP	267,796,778	0.2%	63.9%
_STORE_FAST _LOAD_FAST_0	266,956,143	0.2%	64.1%
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS_4	264,603,680	0.2%	64.3%
_INIT_CALL_PY_EXACT_ARGS_4 _SAVE_RETURN_OFFSET	264,603,680	0.2%	64.6%
_GUARD_IS_TRUE_POP _JUMP_TO_TOP	263,427,350	0.2%	64.8%
_COPY _COPY	258,412,840	0.2%	65.0%
_SWAP _SWAP	258,412,840	0.2%	65.2%
_CHECK_VALIDITY_AND_SET_IP _CHECK_FUNCTION_EXACT_ARGS	255,792,112	0.2%	65.4%
_POP_FRAME _CHECK_VALIDITY	255,171,976	0.2%	65.6%
_CHECK_VALIDITY _LOAD_FAST_6	254,413,899	0.2%	65.8%
_LOAD_FAST _GUARD_BOTH_FLOAT	252,484,080	0.2%	66.0%
_LOAD_FAST_1 _LOAD_FAST	251,462,700	0.2%	66.2%
_CHECK_VALIDITY_AND_SET_IP _BINARY_SUBSCR	250,658,220	0.2%	66.4%
_SET_IP _STORE_SUBSCR_LIST_INT	247,430,260	0.2%	66.6%
_STORE_SUBSCR_LIST_INT _CHECK_VALIDITY	247,379,380	0.2%	66.8%
_GUARD_IS_FALSE_POP _LOAD_FAST_0	243,101,703	0.2%	67.0%
_GUARD_BOTH_FLOAT _BINARY_OP_ADD_FLOAT	240,825,200	0.2%	67.2%
_CHECK_VALIDITY _STORE_FAST_3	238,876,949	0.2%	67.4%
_LOAD_ATTR_METHOD_NO_DICT _CHECK_VALIDITY_AND_SET_IP	237,010,581	0.2%	67.5%
_SET_IP _BUILD_TUPLE	232,345,271	0.2%	67.7%
_CHECK_VALIDITY _LOAD_FAST_5	229,054,560	0.2%	67.9%
_ITER_NEXT_LIST _UNPACK_SEQUENCE_TWO_TUPLE	228,795,689	0.2%	68.1%
_CHECK_VALIDITY _POP_TOP	227,671,252	0.2%	68.3%
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS_1	225,298,516	0.2%	68.5%
_INIT_CALL_PY_EXACT_ARGS_1 _SAVE_RETURN_OFFSET	225,298,516	0.2%	68.6%
_STORE_SUBSCR _CHECK_VALIDITY	221,273,340	0.2%	68.8%
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS_0	218,067,892	0.2%	69.0%
_INIT_CALL_PY_EXACT_ARGS_0 _SAVE_RETURN_OFFSET	218,067,892	0.2%	69.2%
_CHECK_VALIDITY _IS_OP	217,218,360	0.2%	69.3%
_LOAD_ATTR_METHOD_NO_DICT _CHECK_VALIDITY	216,180,698	0.2%	69.5%
_STORE_FAST_6 _LOAD_CONST_INLINE_WITH_NULL	216,000,120	0.2%	69.7%
_SET_IP _FOR_ITER_TIER_TWO	215,603,443	0.2%	69.8%
_SET_IP _STORE_SUBSCR	215,511,720	0.2%	70.0%
_GUARD_IS_TRUE_POP _LOAD_FAST_0	214,557,913	0.2%	70.2%
_FOR_ITER_TIER_TWO _CHECK_VALIDITY	204,795,848	0.2%	70.3%
_STORE_FAST _LOAD_FAST_1	204,726,360	0.2%	70.5%
_BINARY_OP_MULTIPLY_FLOAT _GUARD_BOTH_FLOAT	204,052,800	0.2%	70.7%
_LOAD_FAST_2 _GUARD_BOTH_FLOAT	203,644,140	0.2%	70.8%
_BINARY_SUBSCR_STR_INT _LOAD_FAST_2	201,946,740	0.2%	71.0%
_RESUME_CHECK _CHECK_GLOBALS	201,314,422	0.2%	71.2%
_LOAD_ATTR _CHECK_VALIDITY_AND_SET_IP	201,253,609	0.2%	71.3%
_GUARD_BOTH_INT _BINARY_OP_SUBTRACT_INT	200,174,913	0.2%	71.5%
_GUARD_IS_TRUE_POP _LOAD_FAST_6	198,824,423	0.2%	71.6%
_STORE_FAST_1 _LOAD_FAST_0	197,411,380	0.2%	71.8%
_LOAD_FAST_4 _LOAD_CONST_INLINE_BORROW	196,219,220	0.2%	71.9%
_STORE_FAST_1 _STORE_FAST_2	194,203,240	0.2%	72.1%
_UNPACK_SEQUENCE_TWO_TUPLE _STORE_FAST_1	194,139,940	0.2%	72.3%
_CHECK_BUILTINS _LOAD_CONST_INLINE_BORROW_WITH_NULL	193,745,723	0.2%	72.4%
_BUILD_TUPLE _CHECK_VALIDITY	191,310,751	0.2%	72.6%
_GUARD_NOT_EXHAUSTED_TUPLE _ITER_NEXT_TUPLE	190,236,760	0.2%	72.7%
_BINARY_OP_MULTIPLY_FLOAT _EXIT_TRACE	190,022,880	0.2%	72.9%
_GUARD_IS_FALSE_POP _LOAD_FAST_3	189,057,938	0.2%	73.0%
_LOAD_FAST _LOAD_FAST_2	187,786,740	0.1%	73.2%
_BUILD_LIST _CHECK_VALIDITY	187,657,180	0.1%	73.3%
_IS_OP _GUARD_IS_TRUE_POP	186,597,540	0.1%	73.5%
_CHECK_BUILTINS _LOAD_CONST_INLINE_BORROW	184,752,800	0.1%	73.6%
_GUARD_BOTH_FLOAT _BINARY_OP_SUBTRACT_FLOAT	182,982,000	0.1%	73.8%
_SET_IP _POP_FRAME	182,906,060	0.1%	73.9%
_STORE_FAST_2 _SET_IP	179,819,340	0.1%	74.0%
_CHECK_VALIDITY_AND_SET_IP _CALL_METHOD_DESCRIPTOR_NOARGS	177,994,518	0.1%	74.2%
_LOAD_FAST_5 _SET_IP	176,186,875	0.1%	74.3%
_CALL_ISINSTANCE _CHECK_VALIDITY	174,969,800	0.1%	74.5%
_CHECK_VALIDITY _JUMP_TO_TOP	174,878,453	0.1%	74.6%
_LOAD_CONST_INLINE_BORROW _BINARY_SUBSCR_LIST_INT	173,368,153	0.1%	74.7%
_LOAD_ATTR_INSTANCE_VALUE_0 _LOAD_FAST_1	172,308,499	0.1%	74.9%
_SET_IP _CALL_ISINSTANCE	171,708,280	0.1%	75.0%
_CHECK_GLOBALS _LOAD_CONST_INLINE_WITH_NULL	168,917,039	0.1%	75.2%
_STORE_FAST_3 _LOAD_FAST_3	168,515,841	0.1%	75.3%
_LOAD_CONST_INLINE_BORROW _LOAD_FAST	166,009,860	0.1%	75.4%
_SET_IP _BUILD_LIST	165,738,859	0.1%	75.5%
_LOAD_CONST_INLINE_BORROW _STORE_FAST	164,171,320	0.1%	75.7%
_BINARY_SUBSCR_LIST_INT _LOAD_CONST_INLINE_BORROW	163,476,280	0.1%	75.8%
_LOAD_FAST _BINARY_OP_MULTIPLY_FLOAT	163,144,440	0.1%	75.9%
_LOAD_FAST_5 _GUARD_TYPE_VERSION	160,834,296	0.1%	76.1%
_LOAD_FAST_2 _TO_BOOL_BOOL	160,445,300	0.1%	76.2%
_STORE_FAST _LOAD_FAST_6	158,992,240	0.1%	76.3%
_CALL_METHOD_DESCRIPTOR_NOARGS _CHECK_VALIDITY	158,396,838	0.1%	76.4%
_BINARY_SUBSCR_LIST_INT _STORE_FAST	158,126,100	0.1%	76.6%
_LOAD_FAST_7 _SET_IP	156,425,215	0.1%	76.7%
_CHECK_VALIDITY_AND_SET_IP _LOAD_GLOBAL	155,172,060	0.1%	76.8%
_LOAD_GLOBAL _CHECK_VALIDITY	155,010,540	0.1%	76.9%
_STORE_FAST _LOAD_CONST_INLINE_BORROW	154,256,140	0.1%	77.1%
_COPY _BINARY_SUBSCR_LIST_INT	152,903,800	0.1%	77.2%
_LOAD_FAST_1 _GUARD_TYPE_VERSION	152,447,720	0.1%	77.3%
_LOAD_FAST_1 _LOAD_FAST_2	150,140,840	0.1%	77.4%
_STORE_FAST_6 _STORE_FAST_7	144,966,539	0.1%	77.5%
_GUARD_IS_FALSE_POP _CHECK_GLOBALS	142,970,224	0.1%	77.7%
_STORE_FAST_5 _LOAD_FAST_5	141,867,000	0.1%	77.8%
_LOAD_FAST_6 _LOAD_FAST	141,031,600	0.1%	77.9%
_CHECK_VALIDITY _LOAD_FAST_3	139,320,893	0.1%	78.0%
_LOAD_CONST_INLINE_BORROW_WITH_NULL _LOAD_FAST_1	134,694,560	0.1%	78.1%
_LOAD_FAST_5 _LOAD_CONST_INLINE	133,342,480	0.1%	78.2%
_LOAD_FAST_1 _CALL_TYPE_1	132,000,260	0.1%	78.3%
_CALL_TYPE_1 _STORE_FAST_5	132,000,000	0.1%	78.4%
_COPY _SET_IP	131,831,380	0.1%	78.5%
_TO_BOOL _CHECK_VALIDITY	131,621,759	0.1%	78.6%
_STORE_FAST_4 _LOAD_FAST_4	130,358,520	0.1%	78.7%
_LOAD_CONST_INLINE_BORROW _EXIT_TRACE	129,538,213	0.1%	78.8%
_STORE_FAST_7 _STORE_FAST	128,288,760	0.1%	78.9%
_STORE_FAST_5 _LOAD_FAST_3	127,732,520	0.1%	79.0%
_GUARD_IS_TRUE_POP _EXIT_TRACE	127,027,900	0.1%	79.1%
_PUSH_NULL _LOAD_FAST_0	126,747,616	0.1%	79.2%
_CHECK_GLOBALS _LOAD_CONST_INLINE	126,352,513	0.1%	79.3%
_TO_BOOL_INT _GUARD_IS_TRUE_POP	125,590,232	0.1%	79.4%
_CHECK_VALIDITY _LOAD_FAST_4	124,237,954	0.1%	79.5%
_LOAD_FAST _TO_BOOL_INT	123,928,500	0.1%	79.6%
_SET_IP _CALL_LEN	121,903,322	0.1%	79.7%
_BINARY_OP_SUBTRACT_INT _SET_IP	121,373,680	0.1%	79.8%
_ITER_NEXT_LIST _STORE_FAST	118,857,900	0.1%	79.9%
_ITER_NEXT_LIST _STORE_FAST_5	118,741,560	0.1%	80.0%
_CHECK_VALIDITY _UNPACK_SEQUENCE_TWO_TUPLE	118,390,080	0.1%	80.1%
_CHECK_VALIDITY_AND_SET_IP _POP_FRAME	118,164,340	0.1%	80.2%
_STORE_FAST_5 _STORE_FAST_6	117,718,590	0.1%	80.3%
_LOAD_FAST_1 _LOAD_FAST_4	117,389,920	0.1%	80.4%
_GUARD_IS_TRUE_POP _CHECK_GLOBALS	116,743,660	0.1%	80.5%
_SET_IP _BINARY_SUBSCR_DICT	116,503,380	0.1%	80.6%
_GUARD_IS_FALSE_POP _LOAD_FAST_2	115,268,052	0.1%	80.7%
_CALL_LEN _CHECK_VALIDITY	113,575,439	0.1%	80.8%
_LOAD_FAST_3 _GUARD_TYPE_VERSION	111,886,724	0.1%	80.9%
_BINARY_OP_ADD_FLOAT _SWAP	111,719,820	0.1%	80.9%
_BINARY_OP_ADD_INT _STORE_FAST	110,395,580	0.1%	81.0%
_CHECK_VALIDITY _STORE_FAST_4	108,393,839	0.1%	81.1%
_SET_IP _STORE_SLICE	108,249,300	0.1%	81.2%
_STORE_SLICE _CHECK_VALIDITY	108,071,340	0.1%	81.3%
_BINARY_OP_ADD_INT _LOAD_CONST_INLINE_BORROW	108,028,920	0.1%	81.4%
_BINARY_OP_MULTIPLY_FLOAT _BINARY_OP_ADD_FLOAT	106,167,460	0.1%	81.5%
_CHECK_VALIDITY _STORE_FAST_7	106,076,219	0.1%	81.5%
_SET_IP _BINARY_OP_MULTIPLY_INT	106,059,360	0.1%	81.6%
_BINARY_OP _LOAD_FAST_0	105,776,580	0.1%	81.7%
_SET_IP _BUILD_SLICE	104,807,000	0.1%	81.8%
_BUILD_SLICE _CHECK_VALIDITY_AND_SET_IP	104,807,000	0.1%	81.9%
_LOAD_FAST_3 _CHECK_GLOBALS	103,977,683	0.1%	82.0%
_LOAD_FAST_1 _UNPACK_SEQUENCE_TUPLE	102,015,240	0.1%	82.0%
_POP_TOP _LOAD_FAST_0	101,912,673	0.1%	82.1%
_STORE_FAST _LOAD_FAST_4	101,748,360	0.1%	82.2%
_UNPACK_SEQUENCE_TUPLE _STORE_FAST_5	101,269,140	0.1%	82.3%
_GUARD_IS_FALSE_POP _LOAD_CONST_INLINE_BORROW	101,074,840	0.1%	82.4%
_CHECK_VALIDITY _GUARD_BOTH_FLOAT	100,770,780	0.1%	82.4%
_BINARY_OP _SET_IP	98,382,771	0.1%	82.5%
_BINARY_SUBSCR_LIST_INT _SET_IP	98,036,040	0.1%	82.6%
_LOAD_FAST_4 _LOAD_FAST	97,939,980	0.1%	82.7%
_BINARY_SUBSCR_DICT _CHECK_VALIDITY	96,378,220	0.1%	82.8%
_STORE_FAST_0 _LOAD_FAST_0	96,233,601	0.1%	82.8%
_CHECK_VALIDITY _PUSH_NULL	96,107,180	0.1%	82.9%
_SET_IP _GET_ITER	96,001,942	0.1%	83.0%
_SET_IP _LIST_EXTEND	95,871,519	0.1%	83.1%
_CHECK_VALIDITY_AND_SET_IP _CALL_INTRINSIC_1	95,527,357	0.1%	83.1%
_CALL_INTRINSIC_1 _CHECK_VALIDITY	95,527,357	0.1%	83.2%
_LIST_EXTEND _CHECK_VALIDITY_AND_SET_IP	95,527,357	0.1%	83.3%
_ITER_NEXT_LIST _STORE_FAST_1	95,097,534	0.1%	83.4%
_STORE_FAST_4 _LOAD_FAST_1	94,720,020	0.1%	83.4%
_LOAD_FAST _PUSH_NULL	94,539,900	0.1%	83.5%
_SET_IP _GET_ANEXT	94,136,760	0.1%	83.6%
_GET_ANEXT _CHECK_VALIDITY	94,136,760	0.1%	83.7%
_BINARY_OP_ADD_INT _STORE_FAST_4	94,057,860	0.1%	83.7%
_SET_IP _TO_BOOL	94,014,919	0.1%	83.8%
_CHECK_VALIDITY_AND_SET_IP _BINARY_OP	91,962,588	0.1%	83.9%
_LOAD_ATTR_METHOD_WITH_VALUES _CHECK_VALIDITY_AND_SET_IP	90,886,532	0.1%	84.0%
_STORE_SUBSCR_DICT _CHECK_VALIDITY	89,729,240	0.1%	84.0%
_SET_IP _STORE_SUBSCR_DICT	89,352,720	0.1%	84.1%
_COPY _TO_BOOL_BOOL	88,503,260	0.1%	84.2%
_LIST_APPEND _JUMP_TO_TOP	87,956,800	0.1%	84.2%
_GET_ITER _CHECK_VALIDITY	87,952,440	0.1%	84.3%
_GUARD_IS_FALSE_POP _LOAD_FAST	86,627,840	0.1%	84.4%
_LOAD_CONST_INLINE_BORROW _COPY	86,422,960	0.1%	84.5%
_GUARD_IS_FALSE_POP _EXIT_TRACE	85,668,020	0.1%	84.5%
_GUARD_IS_FALSE_POP _LOAD_FAST_5	85,249,540	0.1%	84.6%
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST_0	83,969,542	0.1%	84.7%
_STORE_FAST_7 _LOAD_FAST_3	83,852,500	0.1%	84.7%
_LOAD_ATTR_INSTANCE_VALUE_0 _GUARD_BOTH_FLOAT	83,571,960	0.1%	84.8%
_CHECK_VALIDITY _STORE_FAST_0	83,072,901	0.1%	84.9%
_SET_IP _BINARY_OP_SUBTRACT_INT	82,654,660	0.1%	84.9%
_POP_FRAME _CHECK_VALIDITY_AND_SET_IP	82,142,040	0.1%	85.0%
_STORE_FAST_2 _LOAD_FAST_2	79,313,095	0.1%	85.1%
_BINARY_SUBSCR_LIST_INT _LOAD_FAST	78,536,760	0.1%	85.1%
_STORE_FAST_1 _SET_IP	78,197,020	0.1%	85.2%
_TO_BOOL_NONE _GUARD_IS_FALSE_POP	78,040,120	0.1%	85.2%
_LOAD_FAST_2 _LOAD_CONST_INLINE_BORROW	77,069,128	0.1%	85.3%
_LOAD_FAST_0 _LOAD_FAST	76,961,520	0.1%	85.4%
_CHECK_VALIDITY _UNPACK_SEQUENCE_TUPLE	76,396,220	0.1%	85.4%
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST_2	76,231,540	0.1%	85.5%
_LOAD_FAST_6 _SET_IP	75,845,515	0.1%	85.5%
_GUARD_TYPE_VERSION _GUARD_DORV_VALUES	75,408,865	0.1%	85.6%
_GUARD_DORV_VALUES _STORE_ATTR_INSTANCE_VALUE	75,147,745	0.1%	85.7%
_LOAD_CONST_INLINE_BORROW _LOAD_FAST_2	74,313,820	0.1%	85.7%
_LOAD_FAST_3 _TO_BOOL_NONE	73,650,180	0.1%	85.8%
_STORE_FAST _LOAD_FAST_7	73,601,700	0.1%	85.8%
_LOAD_FAST_3 _LOAD_CONST_INLINE_BORROW	73,537,098	0.1%	85.9%
_LOAD_FAST_7 _LOAD_FAST	73,230,503	0.1%	86.0%
_ITER_NEXT_TUPLE _STORE_FAST	72,688,740	0.1%	86.0%
_BINARY_OP_SUBTRACT_FLOAT _STORE_FAST	72,579,060	0.1%	86.1%
_GUARD_IS_FALSE_POP _JUMP_TO_TOP	72,541,965	0.1%	86.1%
_BINARY_OP_ADD_FLOAT _STORE_FAST	72,535,680	0.1%	86.2%
_LOAD_FAST_2 _LOAD_FAST_7	72,521,740	0.1%	86.2%
_LOAD_ATTR_INSTANCE_VALUE_0 _LOAD_FAST_0	72,503,795	0.1%	86.3%
_LOAD_ATTR_SLOT_0 _TO_BOOL_BOOL	72,429,404	0.1%	86.4%
_SET_IP _BINARY_SLICE	72,209,260	0.1%	86.4%
_STORE_FAST _SET_IP	72,111,240	0.1%	86.5%
_LOAD_FAST_5 _CHECK_GLOBALS	72,079,620	0.1%	86.5%
_LOAD_FAST_7 _LOAD_FAST_2	72,078,820	0.1%	86.6%
_STORE_FAST_1 _LOAD_FAST_1	71,839,797	0.1%	86.6%
_GUARD_IS_TRUE_POP _LOAD_FAST_1	71,386,691	0.1%	86.7%
_BUILD_TUPLE _CHECK_VALIDITY_AND_SET_IP	71,352,260	0.1%	86.8%
_STORE_FAST_3 _LOAD_FAST_2	70,183,595	0.1%	86.8%
_LOAD_FAST_3 _LOAD_FAST_5	69,998,500	0.1%	86.9%
_LOAD_FAST_1 _EXIT_TRACE	69,826,233	0.1%	86.9%
_CHECK_VALIDITY _CHECK_GLOBALS	69,437,306	0.1%	87.0%
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST_3	68,845,620	0.1%	87.0%
_LOAD_FAST _COPY	68,408,220	0.1%	87.1%
_RESUME_CHECK _LOAD_FAST_1	67,540,680	0.1%	87.1%
_STORE_ATTR_INSTANCE_VALUE _CHECK_VALIDITY	67,303,785	0.1%	87.2%
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS_2	67,232,420	0.1%	87.3%
_INIT_CALL_PY_EXACT_ARGS_2 _SAVE_RETURN_OFFSET	67,232,420	0.1%	87.3%
_BINARY_OP_SUBTRACT_FLOAT _LOAD_FAST_1	67,082,400	0.1%	87.4%
_ITER_NEXT_TUPLE _STORE_FAST_4	65,577,800	0.1%	87.4%
_CHECK_VALIDITY _STORE_FAST_1	65,312,380	0.1%	87.5%
_COMPARE_OP _CHECK_VALIDITY	64,794,840	0.1%	87.5%
_STORE_FAST_5 _LOAD_FAST_4	64,511,140	0.1%	87.6%
_LOAD_FAST_2 _LOAD_FAST_5	64,089,340	0.1%	87.6%
_PUSH_NULL _LOAD_FAST_2	64,065,260	0.1%	87.7%
_LOAD_FAST_4 _LOAD_FAST_0	63,772,400	0.1%	87.7%
_LOAD_FAST_2 _PUSH_NULL	63,270,712	0.1%	87.8%
_LOAD_FAST_1 _LOAD_FAST_0	63,116,280	0.1%	87.8%
_LOAD_FAST _EXIT_TRACE	62,407,760	0.0%	87.9%
_UNPACK_SEQUENCE_TUPLE _STORE_FAST	62,321,660	0.0%	87.9%
_LOAD_FAST_4 _LOAD_FAST_3	61,960,540	0.0%	88.0%
_CHECK_VALIDITY _STORE_FAST_2	61,511,887	0.0%	88.0%
_PUSH_NULL _LOAD_FAST_5	61,445,780	0.0%	88.1%
_BINARY_OP_MULTIPLY_INT _STORE_FAST	61,320,000	0.0%	88.1%
_BINARY_OP_MULTIPLY_FLOAT _LOAD_FAST	61,112,480	0.0%	88.2%
_POP_TOP _LOAD_FAST_1	61,092,660	0.0%	88.2%
_LOAD_CONST_INLINE _EXIT_TRACE	61,017,393	0.0%	88.3%
_GUARD_TYPE_VERSION _CHECK_ATTR_WITH_HINT	60,898,980	0.0%	88.3%
_CHECK_ATTR_WITH_HINT _LOAD_ATTR_WITH_HINT	60,898,980	0.0%	88.4%
_LOAD_CONST_INLINE _IS_OP	60,799,660	0.0%	88.4%
_SET_IP _CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS	60,360,368	0.0%	88.5%
_LOAD_FAST_4 _PUSH_NULL	59,852,400	0.0%	88.5%
_LOAD_CONST_INLINE _PUSH_NULL	59,793,536	0.0%	88.5%
_STORE_FAST_3 _JUMP_TO_TOP	59,593,260	0.0%	88.6%
_LOAD_CONST_INLINE_BORROW _LOAD_CONST_INLINE	59,522,820	0.0%	88.6%
_BINARY_OP_MULTIPLY_INT _LOAD_CONST_INLINE_BORROW	59,219,700	0.0%	88.7%
_CHECK_VALIDITY _STORE_FAST_5	58,850,220	0.0%	88.7%
_SET_IP _CALL_METHOD_DESCRIPTOR_FAST	58,591,914	0.0%	88.8%
_BINARY_OP _LOAD_CONST_INLINE_BORROW	58,533,300	0.0%	88.8%
_IS_OP _GUARD_IS_FALSE_POP	57,810,240	0.0%	88.9%
_GUARD_IS_TRUE_POP _LOAD_CONST_INLINE	57,695,360	0.0%	88.9%
_LOAD_FAST_2 _EXIT_TRACE	57,548,440	0.0%	89.0%
_BINARY_SUBSCR _CHECK_VALIDITY_AND_SET_IP	57,501,900	0.0%	89.0%
_LOAD_FAST_0 _COPY	57,340,408	0.0%	89.1%
_GUARD_TYPE_VERSION _STORE_ATTR_SLOT	57,184,694	0.0%	89.1%
_COPY _GUARD_TYPE_VERSION	57,039,148	0.0%	89.2%
_CHECK_VALIDITY_AND_SET_IP _LIST_APPEND	55,780,722	0.0%	89.2%
_BINARY_OP_SUBTRACT_INT _SWAP	55,726,933	0.0%	89.2%
_BINARY_OP_SUBTRACT_FLOAT _SWAP	55,700,820	0.0%	89.3%
_LOAD_FAST_5 _LOAD_FAST_1	55,537,440	0.0%	89.3%
_SET_IP _COMPARE_OP	55,478,220	0.0%	89.4%
_STORE_FAST_4 _STORE_FAST_5	55,409,260	0.0%	89.4%
_LOAD_CONST_INLINE _STORE_FAST	55,344,460	0.0%	89.5%
_STORE_FAST_4 _LOAD_FAST_3	55,162,160	0.0%	89.5%
_LOAD_FAST _LOAD_FAST_5	54,943,620	0.0%	89.5%
_RESUME_CHECK _LOAD_CONST_INLINE_BORROW	54,463,943	0.0%	89.6%
_LOAD_FAST_5 _LOAD_FAST_4	54,448,360	0.0%	89.6%
_GUARD_IS_TRUE_POP _LOAD_CONST_INLINE_WITH_NULL	54,411,000	0.0%	89.7%
_LOAD_FAST_6 _CHECK_GLOBALS	54,384,500	0.0%	89.7%
_LOAD_ATTR_INSTANCE_VALUE_0 _LOAD_FAST_4	54,113,560	0.0%	89.8%
_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS _CHECK_VALIDITY	54,091,120	0.0%	89.8%
_STORE_FAST _LOAD_CONST_INLINE_WITH_NULL	54,006,060	0.0%	89.9%
_CHECK_VALIDITY_AND_SET_IP _CONTAINS_OP	53,520,308	0.0%	89.9%
_LOAD_FAST_2 _LOAD_FAST	52,934,660	0.0%	89.9%
_STORE_FAST_3 _CHECK_GLOBALS	52,811,268	0.0%	90.0%
_UNPACK_SEQUENCE_TWO_TUPLE _STORE_FAST_4	52,377,420	0.0%	90.0%
_CHECK_GLOBALS _LOAD_CONST_INLINE_BORROW	51,414,923	0.0%	90.1%
_STORE_ATTR_SLOT _CHECK_VALIDITY	51,138,194	0.0%	90.1%
_LOAD_FAST_1 _BINARY_SUBSCR_LIST_INT	51,002,440	0.0%	90.1%
_BINARY_OP _SWAP	50,433,900	0.0%	90.2%
_LOAD_FAST_3 _PUSH_NULL	49,650,140	0.0%	90.2%
_LOAD_FAST _CHECK_GLOBALS	49,621,840	0.0%	90.3%
_LOAD_ATTR_WITH_HINT _CHECK_VALIDITY	49,181,760	0.0%	90.3%
_PUSH_NULL _LOAD_CONST_INLINE_BORROW	48,178,520	0.0%	90.3%
_LOAD_FAST_7 _PUSH_NULL	47,669,120	0.0%	90.4%
_STORE_FAST _JUMP_TO_TOP	47,062,620	0.0%	90.4%
_LOAD_CONST_INLINE _LOAD_FAST_0	47,005,920	0.0%	90.5%
_BINARY_SLICE _CHECK_VALIDITY	46,908,020	0.0%	90.5%
_LOAD_FAST_1 _LOAD_FAST_6	46,478,220	0.0%	90.5%
_LOAD_CONST_INLINE_BORROW _LOAD_FAST_1	46,176,120	0.0%	90.6%
_POP_TOP _LOAD_FAST	45,819,900	0.0%	90.6%
_GUARD_IS_TRUE_POP _LOAD_FAST	45,734,000	0.0%	90.6%
_PUSH_NULL _LOAD_FAST_4	45,637,300	0.0%	90.7%
_BINARY_OP_ADD_INT _LOAD_FAST_5	45,420,360	0.0%	90.7%
_LOAD_FAST _TO_BOOL_BOOL	45,270,880	0.0%	90.7%
_GUARD_KEYS_VERSION _LOAD_ATTR_NONDESCRIPTOR_WITH_VALUES	45,256,000	0.0%	90.8%
_LOAD_ATTR_INSTANCE_VALUE_0 _LOAD_FAST_2	45,103,440	0.0%	90.8%
_LOAD_FAST_4 _COPY	44,631,180	0.0%	90.9%
_PUSH_NULL _LOAD_CONST_INLINE	44,062,880	0.0%	90.9%
_CALL_METHOD_DESCRIPTOR_FAST _CHECK_VALIDITY	43,878,174	0.0%	90.9%
_ITER_NEXT_LIST _STORE_FAST_2	43,836,000	0.0%	91.0%
_BINARY_OP_ADD_INT _LOAD_FAST_0	43,125,160	0.0%	91.0%
_STORE_FAST_2 _LOAD_FAST_0	43,017,372	0.0%	91.0%
_BINARY_OP _BINARY_SUBSCR_LIST_INT	42,820,920	0.0%	91.1%
_ITER_NEXT_LIST _STORE_FAST_3	42,483,314	0.0%	91.1%
_LOAD_FAST _LOAD_FAST_7	42,385,780	0.0%	91.1%
_LOAD_CONST_INLINE_WITH_NULL _LOAD_FAST	42,334,620	0.0%	91.2%
_CHECK_VALIDITY_AND_SET_IP _CALL_BUILTIN_FAST	42,237,000	0.0%	91.2%
_BINARY_OP_ADD_FLOAT _LOAD_FAST_0	41,992,140	0.0%	91.2%
_GUARD_IS_FALSE_POP _POP_TOP	41,590,740	0.0%	91.3%
_CALL_BUILTIN_FAST _CHECK_VALIDITY_AND_SET_IP	41,578,300	0.0%	91.3%
_LOAD_CONST_INLINE_BORROW _TO_BOOL_BOOL	41,380,800	0.0%	91.3%
_LOAD_FAST_4 _LOAD_FAST_1	41,340,900	0.0%	91.4%
_COMPARE_OP_FLOAT _CHECK_VALIDITY	40,939,500	0.0%	91.4%
_GUARD_BOTH_FLOAT _COMPARE_OP_FLOAT	40,939,500	0.0%	91.4%
_TO_BOOL_BOOL _EXIT_TRACE	40,718,080	0.0%	91.5%
_BINARY_OP _GUARD_BOTH_FLOAT	40,680,320	0.0%	91.5%
_PUSH_NULL _LOAD_FAST	40,521,420	0.0%	91.5%
_RESUME_CHECK _LOAD_CONST_INLINE	40,402,800	0.0%	91.6%
_BINARY_OP_ADD_FLOAT _STORE_FAST_3	40,248,000	0.0%	91.6%
_LOAD_FAST_5 _EXIT_TRACE	40,229,840	0.0%	91.6%
_SET_IP _GUARD_BOTH_FLOAT	39,838,320	0.0%	91.6%
_STORE_FAST_5 _LOAD_FAST_2	39,431,880	0.0%	91.7%
_CHECK_STACK_SPACE _INIT_CALL_PY_EXACT_ARGS_3	39,272,600	0.0%	91.7%
_INIT_CALL_PY_EXACT_ARGS_3 _SAVE_RETURN_OFFSET	39,272,600	0.0%	91.7%
_POP_TOP _JUMP_TO_TOP	39,081,200	0.0%	91.8%
_BINARY_OP_MULTIPLY_FLOAT _BINARY_OP_SUBTRACT_FLOAT	38,471,400	0.0%	91.8%
_BINARY_OP_ADD_INT _SWAP	38,352,775	0.0%	91.8%
_LOAD_FAST _LOAD_CONST_INLINE	38,146,880	0.0%	91.9%
_SET_IP _LIST_APPEND	38,067,840	0.0%	91.9%
_STORE_ATTR _CHECK_VALIDITY	37,984,860	0.0%	91.9%
_LOAD_FAST_1 _LOAD_FAST_3	37,965,392	0.0%	92.0%
_CHECK_VALIDITY _COPY	37,818,980	0.0%	92.0%
_CHECK_VALIDITY_AND_SET_IP _TO_BOOL	37,606,840	0.0%	92.0%
_LOAD_FAST_4 _BINARY_SUBSCR_LIST_INT	37,438,980	0.0%	92.0%
_LOAD_FAST_3 _LOAD_CONST_INLINE	37,283,400	0.0%	92.1%
_TO_BOOL_LIST _GUARD_IS_TRUE_POP	36,948,837	0.0%	92.1%
_STORE_FAST_4 _LOAD_FAST_0	36,566,759	0.0%	92.1%
_LOAD_CONST_INLINE_BORROW _POP_FRAME	36,243,616	0.0%	92.2%
_IS_OP _EXIT_TRACE	36,054,120	0.0%	92.2%
_BINARY_OP_MULTIPLY_FLOAT _STORE_FAST	36,053,820	0.0%	92.2%
_STORE_FAST_2 _CHECK_GLOBALS	36,041,600	0.0%	92.2%
_CHECK_VALIDITY_AND_SET_IP _STORE_ATTR	36,004,520	0.0%	92.3%
_LOAD_FAST_6 _LOAD_FAST_7	35,712,623	0.0%	92.3%
_LOAD_FAST_3 _GUARD_BOTH_FLOAT	35,412,900	0.0%	92.3%
_LOAD_FAST_1 _BINARY_SUBSCR_TUPLE_INT	34,756,120	0.0%	92.4%
_LOAD_FAST_6 _LOAD_FAST_4	34,541,580	0.0%	92.4%
_LOAD_FAST_3 _LOAD_FAST_2	34,506,260	0.0%	92.4%
_GUARD_IS_TRUE_POP _POP_TOP	33,666,054	0.0%	92.4%
_STORE_FAST _CHECK_GLOBALS	33,180,720	0.0%	92.5%
_LOAD_FAST_3 _LOAD_FAST_1	32,642,240	0.0%	92.5%
_CHECK_VALIDITY _LOAD_FAST_7	32,375,190	0.0%	92.5%
_LOAD_ATTR_INSTANCE_VALUE_0 _LOAD_FAST_3	32,094,300	0.0%	92.5%
_ITER_NEXT_LIST _STORE_FAST_4	31,995,820	0.0%	92.6%
_GUARD_GLOBALS_VERSION _LOAD_GLOBAL_MODULE	31,871,200	0.0%	92.6%
_LOAD_CONST_INLINE_BORROW _BINARY_SUBSCR_TUPLE_INT	31,641,900	0.0%	92.6%
_COMPARE_OP_INT _CHECK_VALIDITY_AND_SET_IP	31,305,000	0.0%	92.6%
_GUARD_BOTH_INT _BINARY_OP_MULTIPLY_INT	31,291,980	0.0%	92.7%
_LOAD_FAST_5 _LOAD_FAST	31,182,960	0.0%	92.7%
_GUARD_IS_FALSE_POP _LOAD_FAST_4	31,088,960	0.0%	92.7%
_SET_IP _CALL_BUILTIN_CLASS	30,828,122	0.0%	92.7%
_SET_IP _CALL_STR_1	30,760,980	0.0%	92.8%
_BINARY_OP_ADD_INT _COPY	30,660,000	0.0%	92.8%
_CHECK_VALIDITY_AND_SET_IP _BUILD_TUPLE	30,317,740	0.0%	92.8%
_TO_BOOL_BOOL _UNARY_NOT	29,833,560	0.0%	92.8%
_STORE_FAST _LOAD_FAST_5	29,776,800	0.0%	92.9%
_BINARY_OP _STORE_FAST	29,651,151	0.0%	92.9%
_UNPACK_SEQUENCE_TWO_TUPLE _STORE_FAST	29,418,900	0.0%	92.9%
_STORE_FAST_7 _LOAD_FAST_6	29,222,260	0.0%	92.9%
_CALL_METHOD_DESCRIPTOR_FAST _CHECK_VALIDITY_AND_SET_IP	28,551,240	0.0%	93.0%
_LOAD_FAST_2 _GUARD_TYPE_VERSION	28,132,220	0.0%	93.0%
_UNPACK_SEQUENCE_TWO_TUPLE _STORE_FAST_3	28,094,900	0.0%	93.0%
_CHECK_VALIDITY_AND_SET_IP _BUILD_LIST	27,589,661	0.0%	93.0%
_CHECK_VALIDITY _LOAD_CONST_INLINE	27,277,633	0.0%	93.0%
_LOAD_ATTR_INSTANCE_VALUE_0 _LOAD_CONST_INLINE_BORROW	27,139,181	0.0%	93.1%
_LOAD_FAST_5 _COPY	27,075,060	0.0%	93.1%
_LOAD_FAST_5 _BINARY_SUBSCR_LIST_INT	26,751,180	0.0%	93.1%
_BINARY_SUBSCR_DICT _CHECK_VALIDITY_AND_SET_IP	26,593,640	0.0%	93.1%
_GUARD_IS_TRUE_POP _LOAD_FAST_7	26,591,859	0.0%	93.2%
_STORE_FAST_3 _LOAD_CONST_INLINE_BORROW	26,582,134	0.0%	93.2%
_LOAD_FAST_2 _LOAD_CONST_INLINE	26,531,280	0.0%	93.2%
_LOAD_ATTR_SLOT_0 _LOAD_FAST_0	26,402,700	0.0%	93.2%
_GUARD_IS_TRUE_POP _LOAD_FAST_3	26,236,760	0.0%	93.2%
_LOAD_ATTR_INSTANCE_VALUE_0 _COPY	26,098,220	0.0%	93.3%
_STORE_FAST_6 _LOAD_FAST_6	25,801,900	0.0%	93.3%
_CHECK_CALL_BOUND_METHOD_EXACT_ARGS _INIT_CALL_BOUND_METHOD_EXACT_ARGS	25,796,200	0.0%	93.3%
_INIT_CALL_BOUND_METHOD_EXACT_ARGS _CHECK_FUNCTION_EXACT_ARGS	25,796,200	0.0%	93.3%
_SET_IP _COMPARE_OP_INT	25,592,080	0.0%	93.3%
_CALL_STR_1 _CHECK_VALIDITY_AND_SET_IP	25,569,900	0.0%	93.4%
_BINARY_OP_MULTIPLY_FLOAT _LOAD_FAST_1	25,560,900	0.0%	93.4%
_LOAD_ATTR_INSTANCE_VALUE_0 _STORE_FAST_3	25,342,340	0.0%	93.4%
_BINARY_SLICE _CHECK_VALIDITY_AND_SET_IP	25,327,400	0.0%	93.4%
_PUSH_NULL _CHECK_GLOBALS	25,109,740	0.0%	93.4%
_UNARY_NOT _COPY	25,016,660	0.0%	93.5%
_GUARD_IS_TRUE_POP _LOAD_FAST_5	24,924,960	0.0%	93.5%
_LOAD_FAST_3 _BINARY_SUBSCR_LIST_INT	24,703,540	0.0%	93.5%
_LOAD_ATTR_NONDESCRIPTOR_WITH_VALUES _LOAD_FAST	24,697,200	0.0%	93.5%
_LOAD_CONST_INLINE _LOAD_CONST_INLINE	24,548,420	0.0%	93.5%
_CHECK_VALIDITY _SWAP	24,278,270	0.0%	93.6%
_POP_TOP _LOAD_FAST_3	24,035,380	0.0%	93.6%
_UNPACK_SEQUENCE_TUPLE _UNPACK_SEQUENCE_LIST	24,002,040	0.0%	93.6%
_GUARD_TYPE_VERSION _CHECK_ATTR_METHOD_LAZY_DICT	23,961,660	0.0%	93.6%

To assess their potency, I've taken the 326 UOp pairs that account for more that 0.1% of all UOp pairs and run them through the scoring script, which compares the length of the superinstruction in machine instructions to the sum of the lengths of its components. Here are those scores, by both percentage difference and absolute difference:

./python Tools/jit/score.py jit_stencils.h --output_mode "table" --sort_by "delta"

Scoring for top 326 UOp Pairs (x86_64)

UOps	MC Count of Individual Ops	MC Count when Compiled Together	Percentage	Delta
_TO_BOOL_INT / _GUARD_IS_TRUE_POP	331	207	62.54%	-124
_TO_BOOL_NONE / _GUARD_IS_FALSE_POP	243	128	52.67%	-115
_TO_BOOL_BOOL / _GUARD_IS_FALSE_POP	233	133	57.08%	-100
_TO_BOOL_BOOL / _GUARD_IS_TRUE_POP	233	133	57.08%	-100
_GUARD_IS_TRUE_POP / _EXIT_TRACE	201	106	52.74%	-95
_GUARD_IS_FALSE_POP / _EXIT_TRACE	201	106	52.74%	-95
_BINARY_OP_MULTIPLY_FLOAT / _BINARY_OP_ADD_FLOAT	506	413	81.62%	-93
_COMPARE_OP_INT / _CHECK_VALIDITY	439	347	79.04%	-92
_CHECK_MANAGED_OBJECT_HAS_VALUES / _LOAD_ATTR_INSTANCE_VALUE_0	325	237	72.92%	-88
_LIST_APPEND / _JUMP_TO_TOP	265	184	69.43%	-81
_JUMP_TO_TOP / _CHECK_VALIDITY	184	108	58.7%	-76
_FOR_ITER_TIER_TWO / _CHECK_VALIDITY	360	294	81.67%	-66
_ITER_CHECK_LIST / _GUARD_NOT_EXHAUSTED_LIST	183	121	66.12%	-62
_ITER_CHECK_RANGE / _GUARD_NOT_EXHAUSTED_RANGE	171	109	63.74%	-62
_ITER_CHECK_TUPLE / _GUARD_NOT_EXHAUSTED_TUPLE	183	121	66.12%	-62
_CHECK_VALIDITY_AND_SET_IP / _CALL_METHOD_DESCRIPTOR_NOARGS	622	562	90.35%	-60
_CHECK_VALIDITY / _ITER_CHECK_LIST	166	108	65.06%	-58
_CHECK_GLOBALS / _CHECK_BUILTINS	172	114	66.28%	-58
_CHECK_VALIDITY / _ITER_CHECK_TUPLE	166	108	65.06%	-58
_RESUME_CHECK / _CHECK_GLOBALS	169	111	65.68%	-58
_BINARY_SUBSCR_LIST_INT / _STORE_FAST	388	330	85.05%	-58
_CHECK_VALIDITY / _CHECK_GLOBALS	162	104	64.2%	-58
_GUARD_DORV_VALUES_INST_ATTR_FROM_DICT / _GUARD_KEYS_VERSION	232	176	75.86%	-56
_CHECK_VALIDITY / _UNPACK_SEQUENCE_TUPLE	339	283	83.48%	-56
_CHECK_FUNCTION_EXACT_ARGS / _CHECK_STACK_SPACE	313	259	82.75%	-54
_CHECK_VALIDITY / _RESUME_CHECK	159	105	66.04%	-54
_CALL_TYPE_1 / _STORE_FAST_5	422	368	87.2%	-54
_BINARY_OP_ADD_INT / _STORE_FAST_1	312	260	83.33%	-52
_CHECK_VALIDITY / _UNPACK_SEQUENCE_TWO_TUPLE	281	229	81.49%	-52
_BINARY_OP_ADD_INT / _STORE_FAST	327	275	84.1%	-52
_BINARY_OP_ADD_INT / _STORE_FAST_4	312	260	83.33%	-52
_CHECK_VALIDITY_AND_SET_IP / _CHECK_FUNCTION_EXACT_ARGS	259	208	80.31%	-51
_BINARY_OP_SUBTRACT_FLOAT / _STORE_FAST	357	307	85.99%	-50
_BINARY_OP_ADD_FLOAT / _STORE_FAST	357	307	85.99%	-50
_BINARY_SUBSCR_STR_INT / _STORE_FAST_7	540	491	90.93%	-49
_CALL_METHOD_DESCRIPTOR_NOARGS / _CHECK_VALIDITY	608	562	92.43%	-46
_STORE_FAST / _STORE_FAST	208	163	78.37%	-45
_CALL_BUILTIN_O / _CHECK_VALIDITY	593	548	92.41%	-45
_STORE_FAST_1 / _STORE_FAST_2	178	133	74.72%	-45
_IS_OP / _GUARD_IS_TRUE_POP	313	268	85.62%	-45
_STORE_FAST_6 / _STORE_FAST_7	184	139	75.54%	-45
_STORE_FAST_7 / _STORE_FAST	199	154	77.39%	-45
_STORE_FAST_5 / _STORE_FAST_6	178	133	74.72%	-45
_BINARY_SUBSCR_DICT / _CHECK_VALIDITY	409	364	89.0%	-45
_STORE_SUBSCR_LIST_INT / _CHECK_VALIDITY	377	333	88.33%	-44
_LOAD_FAST_1 / _BINARY_SUBSCR_STR_INT	477	435	91.19%	-42
_CALL_ISINSTANCE / _CHECK_VALIDITY	532	490	92.11%	-42
_CALL_LEN / _CHECK_VALIDITY	472	430	91.1%	-42
_COPY / _BINARY_SUBSCR_LIST_INT	338	297	87.87%	-41
_STORE_SUBSCR_DICT / _CHECK_VALIDITY	338	298	88.17%	-40
_LOAD_FAST_1 / _UNPACK_SEQUENCE_TUPLE	295	259	87.8%	-36
_GUARD_BOTH_INT / _BINARY_OP_ADD_INT	348	318	91.38%	-30
_GUARD_BOTH_INT / _BINARY_OP_SUBTRACT_INT	348	318	91.38%	-30
_CALL_BUILTIN_FAST / _CHECK_VALIDITY	590	561	95.08%	-29
_STORE_FAST_1 / _JUMP_TO_TOP	197	168	85.28%	-29
_GUARD_BOTH_UNICODE / _COMPARE_OP_STR	335	307	91.64%	-28
_BINARY_SUBSCR_LIST_INT / _LOAD_CONST_INLINE_BORROW	314	286	91.08%	-28
_BINARY_SUBSCR_LIST_INT / _LOAD_FAST	330	302	91.52%	-28
_LOAD_CONST_INLINE_BORROW / _BINARY_SUBSCR_LIST_INT	314	287	91.4%	-27
_UNPACK_SEQUENCE_TUPLE / _STORE_FAST_5	352	325	92.33%	-27
_GUARD_NOT_EXHAUSTED_LIST / _ITER_NEXT_LIST	149	124	83.22%	-25
_GUARD_NOT_EXHAUSTED_TUPLE / _ITER_NEXT_TUPLE	146	121	82.88%	-25
_LOAD_CONST_INLINE_BORROW / _STORE_FAST	134	109	81.34%	-25
_ITER_NEXT_LIST / _STORE_FAST	160	135	84.38%	-25
_ITER_NEXT_LIST / _STORE_FAST_5	145	120	82.76%	-25
_ITER_NEXT_LIST / _STORE_FAST_1	145	120	82.76%	-25
_ITER_NEXT_TUPLE / _STORE_FAST	157	132	84.08%	-25
_ITER_NEXT_TUPLE / _STORE_FAST_4	142	117	82.39%	-25
_CHECK_VALIDITY / _GUARD_IS_FALSE_POP	194	171	88.14%	-23
_PUSH_FRAME / _CHECK_VALIDITY	144	121	84.03%	-23
_CHECK_VALIDITY / _TO_BOOL_BOOL	191	168	87.96%	-23
_CHECK_VALIDITY / _GUARD_IS_TRUE_POP	194	171	88.14%	-23
_CHECK_VALIDITY / _EXIT_TRACE	159	136	85.53%	-23
_ITER_NEXT_LIST / _UNPACK_SEQUENCE_TWO_TUPLE	261	238	91.19%	-23
_CHECK_VALIDITY / _GUARD_BOTH_FLOAT	201	178	88.56%	-23
_GUARD_IS_TRUE_POP / _JUMP_TO_TOP	226	204	90.27%	-22
_GUARD_IS_FALSE_POP / _JUMP_TO_TOP	226	204	90.27%	-22
_LOAD_FAST / _BINARY_OP_MULTIPLY_FLOAT	299	278	92.98%	-21
_GUARD_TYPE_VERSION / _CHECK_MANAGED_OBJECT_HAS_VALUES	253	233	92.09%	-20
_GUARD_TYPE_VERSION / _GUARD_DORV_VALUES_INST_ATTR_FROM_DICT	253	233	92.09%	-20
_GUARD_BOTH_FLOAT / _BINARY_OP_MULTIPLY_FLOAT	378	358	94.71%	-20
_STORE_FAST / _LOAD_FAST	150	130	86.67%	-20
_GUARD_BOTH_FLOAT / _BINARY_OP_ADD_FLOAT	378	358	94.71%	-20
_GUARD_BOTH_FLOAT / _BINARY_OP_SUBTRACT_FLOAT	378	358	94.71%	-20
_STORE_FAST / _LOAD_CONST_INLINE_BORROW	134	114	85.07%	-20
_STORE_SLICE / _CHECK_VALIDITY	392	372	94.9%	-20
_BINARY_OP / _LOAD_FAST_0	269	249	92.57%	-20
_BINARY_OP_SUBTRACT_FLOAT / _LOAD_FAST_1	285	265	92.98%	-20
_GUARD_IS_FALSE_POP / _LOAD_FAST_7	153	134	87.58%	-19
_GUARD_IS_FALSE_POP / _LOAD_FAST_1	150	131	87.33%	-19
_GUARD_IS_FALSE_POP / _LOAD_FAST_0	150	131	87.33%	-19
_GUARD_IS_TRUE_POP / _LOAD_FAST_0	150	131	87.33%	-19
_BINARY_OP_MULTIPLY_FLOAT / _GUARD_BOTH_FLOAT	378	359	94.97%	-19
_BINARY_SUBSCR_STR_INT / _LOAD_FAST_2	477	458	96.02%	-19
_GUARD_IS_TRUE_POP / _LOAD_FAST_6	150	131	87.33%	-19
_GUARD_IS_FALSE_POP / _LOAD_FAST_3	150	131	87.33%	-19
_GUARD_IS_FALSE_POP / _LOAD_FAST_2	150	131	87.33%	-19
_GUARD_IS_FALSE_POP / _LOAD_CONST_INLINE_BORROW	148	129	87.16%	-19
_GUARD_IS_FALSE_POP / _LOAD_FAST	164	145	88.41%	-19
_GUARD_IS_FALSE_POP / _LOAD_FAST_5	150	131	87.33%	-19
_GUARD_IS_TRUE_POP / _LOAD_FAST_1	150	131	87.33%	-19
_BINARY_OP_ADD_FLOAT / _SWAP	307	289	94.14%	-18
_BINARY_OP_ADD_INT / _LOAD_CONST_INLINE_BORROW	253	235	92.89%	-18
_LOAD_FAST_0 / _GUARD_TYPE_VERSION	153	136	88.89%	-17
_CHECK_VALIDITY_AND_SET_IP / _LOAD_ATTR	405	388	95.8%	-17
_COPY / _COPY	108	91	84.26%	-17
_SWAP / _SWAP	108	91	84.26%	-17
_POP_FRAME / _CHECK_VALIDITY	187	170	90.91%	-17
_LOAD_FAST / _GUARD_BOTH_FLOAT	171	154	90.06%	-17
_LOAD_FAST_2 / _GUARD_BOTH_FLOAT	157	140	89.17%	-17
_LOAD_FAST_5 / _GUARD_TYPE_VERSION	153	136	88.89%	-17
_LOAD_FAST_2 / _TO_BOOL_BOOL	147	130	88.44%	-17
_LOAD_FAST_1 / _GUARD_TYPE_VERSION	153	136	88.89%	-17
_LOAD_FAST_3 / _GUARD_TYPE_VERSION	153	136	88.89%	-17
_COPY / _TO_BOOL_BOOL	169	152	89.94%	-17
_LOAD_CONST_INLINE_BORROW / _COPY	84	67	79.76%	-17
_POP_FRAME / _CHECK_VALIDITY_AND_SET_IP	201	184	91.54%	-17
_LOAD_FAST / _COPY	100	83	83.0%	-17
_LOAD_FAST_7 / _LOAD_CONST_INLINE_BORROW	65	49	75.38%	-16
_LOAD_FAST_0 / _LOAD_FAST_1	64	48	75.0%	-16
_LOAD_FAST_1 / _LOAD_CONST_INLINE_BORROW	62	46	74.19%	-16
_LOAD_FAST_5 / _LOAD_CONST_INLINE_BORROW	62	46	74.19%	-16
_LOAD_FAST_7 / _LOAD_FAST_3	67	51	76.12%	-16
_LOAD_FAST / _LOAD_CONST_INLINE_BORROW	76	60	78.95%	-16
_LOAD_CONST_INLINE_WITH_NULL / _LOAD_FAST_5	78	62	79.49%	-16
_GUARD_KEYS_VERSION / _LOAD_ATTR_METHOD_WITH_VALUES	149	133	89.26%	-16
_LOAD_CONST_INLINE_BORROW / _LOAD_CONST_INLINE_BORROW	60	44	73.33%	-16
_LOAD_FAST / _LOAD_FAST	92	76	82.61%	-16
_GUARD_TYPE_VERSION / _LOAD_ATTR_METHOD_NO_DICT	170	154	90.59%	-16
_GUARD_TYPE_VERSION / _LOAD_ATTR_SLOT_0	309	293	94.82%	-16
_LOAD_FAST_6 / _LOAD_CONST_INLINE_BORROW	62	46	74.19%	-16
_GUARD_IS_FALSE_POP / _LOAD_CONST_INLINE_WITH_NULL	164	148	90.24%	-16
_LOAD_FAST_3 / _LOAD_FAST_4	64	48	75.0%	-16
_LOAD_CONST_INLINE_WITH_NULL / _LOAD_FAST_1	78	62	79.49%	-16
_LOAD_FAST_2 / _LOAD_FAST_3	64	48	75.0%	-16
_CHECK_STACK_SPACE / _INIT_CALL_PY_EXACT_ARGS_4	637	621	97.49%	-16
_LOAD_FAST_1 / _LOAD_FAST	78	62	79.49%	-16
_CHECK_STACK_SPACE / _INIT_CALL_PY_EXACT_ARGS_1	594	578	97.31%	-16
_LOAD_FAST_4 / _LOAD_CONST_INLINE_BORROW	62	46	74.19%	-16
_UNPACK_SEQUENCE_TWO_TUPLE / _STORE_FAST_1	294	278	94.56%	-16
_LOAD_FAST / _LOAD_FAST_2	78	62	79.49%	-16
_LOAD_CONST_INLINE_BORROW / _LOAD_FAST	76	60	78.95%	-16
_LOAD_FAST_1 / _LOAD_FAST_2	64	48	75.0%	-16
_LOAD_FAST_6 / _LOAD_FAST	78	62	79.49%	-16
_LOAD_CONST_INLINE_BORROW_WITH_NULL / _LOAD_FAST_1	70	54	77.14%	-16
_LOAD_FAST_5 / _LOAD_CONST_INLINE	70	54	77.14%	-16
_PUSH_NULL / _LOAD_FAST_0	56	40	71.43%	-16
_LOAD_FAST / _TO_BOOL_INT	259	243	93.82%	-16
_LOAD_FAST_1 / _LOAD_FAST_4	64	48	75.0%	-16
_LOAD_FAST_4 / _LOAD_FAST	78	62	79.49%	-16
_LOAD_FAST / _PUSH_NULL	70	54	77.14%	-16
_LOAD_CONST_INLINE_WITH_NULL / _LOAD_FAST_0	78	62	79.49%	-16
_LOAD_FAST_2 / _LOAD_CONST_INLINE_BORROW	62	46	74.19%	-16
_LOAD_FAST_0 / _LOAD_FAST	78	62	79.49%	-16
_LOAD_CONST_INLINE_WITH_NULL / _LOAD_FAST_2	78	62	79.49%	-16
_LOAD_CONST_INLINE_BORROW / _LOAD_FAST_2	62	46	74.19%	-16
_LOAD_FAST_3 / _LOAD_CONST_INLINE_BORROW	62	46	74.19%	-16
_LOAD_FAST_7 / _LOAD_FAST	81	65	80.25%	-16
_LOAD_FAST_2 / _LOAD_FAST_7	67	51	76.12%	-16
_LOAD_FAST_7 / _LOAD_FAST_2	67	51	76.12%	-16
_LOAD_FAST_3 / _LOAD_FAST_5	64	48	75.0%	-16
_LOAD_CONST_INLINE_WITH_NULL / _LOAD_FAST_3	78	62	79.49%	-16
_CHECK_STACK_SPACE / _INIT_CALL_PY_EXACT_ARGS_2	603	587	97.35%	-16
_LOAD_FAST_2 / _LOAD_FAST_5	64	48	75.0%	-16
_PUSH_NULL / _LOAD_FAST_2	56	40	71.43%	-16
_LOAD_FAST_4 / _LOAD_FAST_0	64	48	75.0%	-16
_LOAD_FAST_2 / _PUSH_NULL	56	40	71.43%	-16
_LOAD_FAST_1 / _LOAD_FAST_0	64	48	75.0%	-16
_GUARD_BOTH_INT / _COMPARE_OP_INT	488	473	96.93%	-15
_CHECK_VALIDITY_AND_SET_IP / _LOAD_GLOBAL	458	443	96.72%	-15
_LOAD_CONST_INLINE_BORROW / _SET_IP	57	44	77.19%	-13
_SET_IP / _GUARD_BOTH_INT	152	139	91.45%	-13
_SET_IP / _GUARD_TYPE_VERSION	148	135	91.22%	-13
_SET_IP / _CONTAINS_OP	287	274	95.47%	-13
_LOAD_FAST_1 / _SET_IP	59	46	77.97%	-13
_CHECK_VALIDITY / _LOAD_FAST_0	108	95	87.96%	-13
_LOAD_FAST_3 / _SET_IP	59	46	77.97%	-13
_CHECK_VALIDITY / _LOAD_FAST_1	108	95	87.96%	-13
_LOAD_FAST_0 / _SET_IP	59	46	77.97%	-13
_SAVE_RETURN_OFFSET / _PUSH_FRAME	95	82	86.32%	-13
_CHECK_VALIDITY / _SET_IP	103	90	87.38%	-13
_SET_IP / _GUARD_BOTH_UNICODE	152	139	91.45%	-13
_SET_IP / _COMPARE_OP_STR	237	224	94.51%	-13
_SET_IP / _BINARY_SUBSCR	250	237	94.8%	-13
_LOAD_FAST / _SET_IP	73	60	82.19%	-13
_CHECK_VALIDITY / _LOAD_FAST	122	109	89.34%	-13
_LOAD_FAST_2 / _SET_IP	59	46	77.97%	-13
_LOAD_FAST_4 / _SET_IP	59	46	77.97%	-13
_SET_IP / _CHECK_FUNCTION_EXACT_ARGS	196	183	93.37%	-13
_CHECK_VALIDITY / _LOAD_CONST_INLINE_BORROW	106	93	87.74%	-13
_SET_IP / _ITER_CHECK_RANGE	117	104	88.89%	-13
_SET_IP / _LOAD_ATTR	342	329	96.2%	-13
_LOAD_ATTR_METHOD_WITH_VALUES / _CHECK_VALIDITY	125	112	89.6%	-13
_SET_IP / _BINARY_OP	264	251	95.08%	-13
_RESUME_CHECK / _LOAD_FAST_0	115	102	88.7%	-13
_BINARY_OP_ADD_INT / _SET_IP	250	237	94.8%	-13
_SET_IP / _BINARY_OP_ADD_INT	250	237	94.8%	-13
_CHECK_VALIDITY / _LOAD_FAST_2	108	95	87.96%	-13
_CHECK_BUILTINS / _LOAD_CONST_INLINE_WITH_NULL	132	119	90.15%	-13
_LOAD_ATTR_INSTANCE_VALUE_0 / _SET_IP	220	207	94.09%	-13
_LOAD_CONST_INLINE / _SET_IP	65	52	80.0%	-13
_SWAP / _SET_IP	81	68	83.95%	-13
_SET_IP / _LOAD_DEREF	181	168	92.82%	-13
_LOAD_ATTR_SLOT_0 / _SET_IP	215	202	93.95%	-13
_INIT_CALL_PY_EXACT_ARGS_4 / _SAVE_RETURN_OFFSET	520	507	97.5%	-13
_CHECK_VALIDITY / _LOAD_FAST_6	108	95	87.96%	-13
_SET_IP / _STORE_SUBSCR_LIST_INT	328	315	96.04%	-13
_LOAD_ATTR_METHOD_NO_DICT / _CHECK_VALIDITY_AND_SET_IP	139	126	90.65%	-13
_SET_IP / _BUILD_TUPLE	207	194	93.72%	-13
_CHECK_VALIDITY / _LOAD_FAST_5	108	95	87.96%	-13
_INIT_CALL_PY_EXACT_ARGS_1 / _SAVE_RETURN_OFFSET	477	464	97.27%	-13
_CHECK_STACK_SPACE / _INIT_CALL_PY_EXACT_ARGS_0	596	583	97.82%	-13
_INIT_CALL_PY_EXACT_ARGS_0 / _SAVE_RETURN_OFFSET	479	466	97.29%	-13
_LOAD_ATTR_METHOD_NO_DICT / _CHECK_VALIDITY	125	112	89.6%	-13
_SET_IP / _FOR_ITER_TIER_TWO	311	298	95.82%	-13
_SET_IP / _STORE_SUBSCR	297	284	95.62%	-13
_CHECK_BUILTINS / _LOAD_CONST_INLINE_BORROW_WITH_NULL	124	111	89.52%	-13
_CHECK_BUILTINS / _LOAD_CONST_INLINE_BORROW	116	103	88.79%	-13
_SET_IP / _POP_FRAME	138	125	90.58%	-13
_STORE_FAST_2 / _SET_IP	116	103	88.79%	-13
_LOAD_FAST_5 / _SET_IP	59	46	77.97%	-13
_LOAD_ATTR_INSTANCE_VALUE_0 / _LOAD_FAST_1	225	212	94.22%	-13
_CHECK_GLOBALS / _LOAD_CONST_INLINE_WITH_NULL	132	119	90.15%	-13
_SET_IP / _BUILD_LIST	207	194	93.72%	-13
_LOAD_FAST_7 / _SET_IP	62	49	79.03%	-13
_GUARD_IS_FALSE_POP / _CHECK_GLOBALS	204	191	93.63%	-13
_CHECK_VALIDITY / _LOAD_FAST_3	108	95	87.96%	-13
_COPY / _SET_IP	81	68	83.95%	-13
_LOAD_CONST_INLINE_BORROW / _EXIT_TRACE	113	100	88.5%	-13
_CHECK_GLOBALS / _LOAD_CONST_INLINE	124	111	89.52%	-13
_CHECK_VALIDITY / _LOAD_FAST_4	108	95	87.96%	-13
_BINARY_OP_SUBTRACT_INT / _SET_IP	250	237	94.8%	-13
_GUARD_IS_TRUE_POP / _CHECK_GLOBALS	204	191	93.63%	-13
_SET_IP / _BINARY_SUBSCR_DICT	360	347	96.39%	-13
_SET_IP / _STORE_SLICE	343	330	96.21%	-13
_SET_IP / _BINARY_OP_MULTIPLY_INT	250	237	94.8%	-13
_SET_IP / _BUILD_SLICE	381	368	96.59%	-13
_LOAD_FAST_3 / _CHECK_GLOBALS	118	105	88.98%	-13
_BINARY_OP / _SET_IP	264	251	95.08%	-13
_CHECK_VALIDITY / _PUSH_NULL	100	87	87.0%	-13
_SET_IP / _GET_ITER	192	179	93.23%	-13
_SET_IP / _LIST_EXTEND	355	342	96.34%	-13
_SET_IP / _GET_ANEXT	390	377	96.67%	-13
_SET_IP / _TO_BOOL	210	197	93.81%	-13
_LOAD_ATTR_METHOD_WITH_VALUES / _CHECK_VALIDITY_AND_SET_IP	139	126	90.65%	-13
_SET_IP / _STORE_SUBSCR_DICT	289	276	95.5%	-13
_SET_IP / _BINARY_OP_SUBTRACT_INT	250	237	94.8%	-13
_STORE_FAST_1 / _SET_IP	116	103	88.79%	-13
_LOAD_FAST_6 / _SET_IP	59	46	77.97%	-13
_LOAD_ATTR_INSTANCE_VALUE_0 / _LOAD_FAST_0	225	212	94.22%	-13
_SET_IP / _BINARY_SLICE	278	265	95.32%	-13
_STORE_FAST / _SET_IP	131	118	90.08%	-13
_LOAD_FAST_5 / _CHECK_GLOBALS	118	105	88.98%	-13
_LOAD_FAST_1 / _EXIT_TRACE	115	102	88.7%	-13
_RESUME_CHECK / _LOAD_FAST_1	115	102	88.7%	-13
_INIT_CALL_PY_EXACT_ARGS_2 / _SAVE_RETURN_OFFSET	486	473	97.33%	-13
_LOAD_ATTR / _CHECK_VALIDITY	391	379	96.93%	-12
_BUILD_TUPLE / _CHECK_VALIDITY	256	244	95.31%	-12
_BINARY_OP_MULTIPLY_FLOAT / _EXIT_TRACE	336	324	96.43%	-12
_BUILD_LIST / _CHECK_VALIDITY	256	244	95.31%	-12
_GUARD_TYPE_VERSION / _GUARD_DORV_VALUES	201	189	94.03%	-12
_BUILD_TUPLE / _CHECK_VALIDITY_AND_SET_IP	270	258	95.56%	-12
_CONTAINS_OP / _CHECK_VALIDITY	336	325	96.73%	-11
_BINARY_SUBSCR / _CHECK_VALIDITY	299	288	96.32%	-11
_SET_IP / _CALL_BUILTIN_FAST	541	530	97.97%	-11
_SET_IP / _CALL_BUILTIN_O	544	533	97.98%	-11
_CHECK_VALIDITY_AND_SET_IP / _BINARY_SUBSCR	313	302	96.49%	-11
_CHECK_VALIDITY_AND_SET_IP / _BINARY_OP	327	316	96.64%	-11
_LOAD_ATTR_INSTANCE_VALUE_0 / _GUARD_BOTH_FLOAT	318	307	96.54%	-11
_CHECK_VALIDITY / _STORE_FAST_0	165	154	93.33%	-11
_LOAD_FAST_3 / _TO_BOOL_NONE	157	146	92.99%	-11
_COMPARE_OP / _CHECK_VALIDITY	412	401	97.33%	-11
_CHECK_VALIDITY_AND_SET_IP / _POP_FRAME	201	191	95.02%	-10
_CHECK_VALIDITY / _STORE_FAST	180	171	95.0%	-9
_ITER_NEXT_RANGE / _CHECK_VALIDITY	204	195	95.59%	-9
_LOAD_DEREF / _CHECK_VALIDITY	230	221	96.09%	-9
_CHECK_VALIDITY / _STORE_FAST_6	165	156	94.55%	-9
_CHECK_VALIDITY / _STORE_FAST_3	165	156	94.55%	-9
_CHECK_VALIDITY / _POP_TOP	152	143	94.08%	-9
_SET_IP / _CALL_ISINSTANCE	483	474	98.14%	-9
_LOAD_FAST_1 / _CALL_TYPE_1	365	356	97.53%	-9
_SET_IP / _CALL_LEN	423	414	97.87%	-9
_CHECK_VALIDITY / _STORE_FAST_4	165	156	94.55%	-9
_CHECK_VALIDITY / _STORE_FAST_7	171	162	94.74%	-9
_BINARY_SUBSCR_LIST_INT / _SET_IP	311	302	97.11%	-9
_CHECK_VALIDITY_AND_SET_IP / _CALL_INTRINSIC_1	276	267	96.74%	-9
_CALL_INTRINSIC_1 / _CHECK_VALIDITY	262	253	96.56%	-9
_GET_ITER / _CHECK_VALIDITY	241	232	96.27%	-9
_CHECK_VALIDITY / _STORE_FAST_1	165	156	94.55%	-9
_GUARD_NOT_EXHAUSTED_RANGE / _ITER_NEXT_RANGE	209	201	96.17%	-8
_STORE_FAST / _LOAD_FAST_0	136	128	94.12%	-8
_STORE_FAST / _LOAD_FAST_1	136	128	94.12%	-8
_LOAD_ATTR / _CHECK_VALIDITY_AND_SET_IP	405	397	98.02%	-8
_STORE_FAST_1 / _LOAD_FAST_0	121	113	93.39%	-8
_STORE_FAST_3 / _LOAD_FAST_3	121	113	93.39%	-8
_STORE_FAST / _LOAD_FAST_6	136	128	94.12%	-8
_STORE_FAST_5 / _LOAD_FAST_5	121	113	93.39%	-8
_STORE_FAST_4 / _LOAD_FAST_4	121	113	93.39%	-8
_STORE_FAST_5 / _LOAD_FAST_3	121	113	93.39%	-8
_POP_TOP / _LOAD_FAST_0	108	100	92.59%	-8
_STORE_FAST / _LOAD_FAST_4	136	128	94.12%	-8
_STORE_FAST_0 / _LOAD_FAST_0	121	113	93.39%	-8
_STORE_FAST_4 / _LOAD_FAST_1	121	113	93.39%	-8
_STORE_FAST_7 / _LOAD_FAST_3	127	119	93.7%	-8
_STORE_FAST_2 / _LOAD_FAST_2	121	113	93.39%	-8
_STORE_FAST_1 / _LOAD_FAST_1	121	113	93.39%	-8
_STORE_FAST_3 / _LOAD_FAST_2	121	113	93.39%	-8
_STORE_FAST_5 / _LOAD_FAST_4	121	113	93.39%	-8
_GET_ANEXT / _CHECK_VALIDITY	439	432	98.41%	-7
_GUARD_DORV_VALUES / _STORE_ATTR_INSTANCE_VALUE	264	257	97.35%	-7
_COMPARE_OP_STR / _CHECK_VALIDITY	286	280	97.9%	-6
_STORE_FAST_7 / _LOAD_FAST_7	130	125	96.15%	-5
_STORE_FAST / _LOAD_FAST_7	139	134	96.4%	-5
_STORE_SUBSCR / _CHECK_VALIDITY	346	342	98.84%	-4
_CHECK_VALIDITY / _JUMP_TO_TOP	184	180	97.83%	-4
_LOAD_GLOBAL / _CHECK_VALIDITY	444	443	99.77%	-1
_STORE_FAST_6 / _LOAD_CONST_INLINE_WITH_NULL	135	136	100.74%	1
_CHECK_VALIDITY / _IS_OP	271	273	100.74%	2
_TO_BOOL / _CHECK_VALIDITY	259	262	101.16%	3
_BUILD_SLICE / _CHECK_VALIDITY_AND_SET_IP	444	453	102.03%	9
_LIST_EXTEND / _CHECK_VALIDITY_AND_SET_IP	418	435	104.07%	17
_START_EXECUTOR / _CHECK_VALIDITY	162	180	111.11%	18
_START_EXECUTOR / _CHECK_VALIDITY_AND_SET_IP	176	194	110.23%	18
_LOAD_ATTR_INSTANCE_VALUE_0 / _TO_BOOL_BOOL	308	326	105.84%	18
_LOAD_ATTR_SLOT_0 / _TO_BOOL_BOOL	303	321	105.94%	18
_STORE_ATTR_INSTANCE_VALUE / _CHECK_VALIDITY	260	279	107.31%	19

Of course, the specifics of exactly how much shorter a superinstruction is will vary with implementation details and by platform.

So, how much shorter is "short enough to be worth it?" Should superinstructions be weighted by their quality and prevalence? Probably some experimentation is needed. And of course there are lots of other ways candidate pairs/sequences could be identified, like @Fidget-Spinner's method above.

I tried a couple arbitrary testing points locally:

Using the 83 pairs* that reduce the machine code count by 20 or more: ~2% Faster (Local)

Benchmarks with tag 'apps':

Benchmark	main-jit	uop-stats-20-mc-or-better-pairs-83
2to3	297 ms	292 ms: 1.02x faster
chameleon	6.51 ms	6.38 ms: 1.02x faster
docutils	2.74 sec	2.65 sec: 1.03x faster
Geometric mean	(ref)	1.01x faster

Benchmark hidden because not significant (2): html5lib, tornado_http

Benchmarks with tag 'asyncio':

Benchmark	main-jit	uop-stats-20-mc-or-better-pairs-83
async_tree_eager_tg	83.6 ms	77.0 ms: 1.09x faster
async_tree_io	1.12 sec	1.06 sec: 1.05x faster
async_tree_memoization_tg	551 ms	530 ms: 1.04x faster
async_tree_eager_memoization	275 ms	264 ms: 1.04x faster
async_tree_io_tg	1.11 sec	1.08 sec: 1.03x faster
async_tree_eager_memoization_tg	206 ms	201 ms: 1.02x faster
async_tree_none_tg	423 ms	415 ms: 1.02x faster
async_tree_memoization	541 ms	533 ms: 1.02x faster
async_tree_cpu_io_mixed_tg	733 ms	741 ms: 1.01x slower
async_tree_eager_cpu_io_mixed	473 ms	478 ms: 1.01x slower
async_tree_eager_io	1.06 sec	1.09 sec: 1.02x slower
async_tree_eager_cpu_io_mixed_tg	409 ms	420 ms: 1.03x slower
Geometric mean	(ref)	1.01x faster

Benchmark hidden because not significant (4): async_tree_cpu_io_mixed, async_tree_eager, async_tree_none, async_tree_eager_io_tg

Benchmarks with tag 'math':

Benchmark	main-jit	uop-stats-20-mc-or-better-pairs-83
nbody	85.3 ms	78.9 ms: 1.08x faster
float	78.1 ms	76.6 ms: 1.02x faster
Geometric mean	(ref)	1.03x faster

Benchmark hidden because not significant (1): pidigits

Benchmarks with tag 'regex':

Benchmark	main-jit	uop-stats-20-mc-or-better-pairs-83
regex_compile	172 ms	169 ms: 1.02x faster
regex_effbot	2.74 ms	2.75 ms: 1.01x slower
regex_dna	160 ms	161 ms: 1.01x slower
regex_v8	21.4 ms	21.9 ms: 1.02x slower
Geometric mean	(ref)	1.00x slower

Benchmarks with tag 'serialize':

Benchmark	main-jit	uop-stats-20-mc-or-better-pairs-83
tomli_loads	2.24 sec	2.13 sec: 1.05x faster
pickle_dict	32.3 us	31.5 us: 1.03x faster
pickle_pure_python	291 us	286 us: 1.02x faster
xml_etree_process	63.4 ms	62.5 ms: 1.01x faster
json_loads	27.0 us	26.7 us: 1.01x faster
json_dumps	10.6 ms	10.5 ms: 1.01x faster
pickle	11.7 us	11.6 us: 1.01x faster
pickle_list	4.96 us	4.93 us: 1.01x faster
Geometric mean	(ref)	1.01x faster

Benchmark hidden because not significant (6): unpickle, xml_etree_iterparse, unpickle_list, unpickle_pure_python, xml_etree_generate, xml_etree_parse

Benchmarks with tag 'startup':

Benchmark	main-jit	uop-stats-20-mc-or-better-pairs-83
python_startup_no_site	12.4 ms	12.8 ms: 1.03x slower
Geometric mean	(ref)	1.01x slower

Benchmark hidden because not significant (1): python_startup

Benchmarks with tag 'template':

Benchmark	main-jit	uop-stats-20-mc-or-better-pairs-83
mako	11.4 ms	11.0 ms: 1.04x faster
genshi_text	24.9 ms	24.2 ms: 1.03x faster
genshi_xml	62.6 ms	63.3 ms: 1.01x slower
Geometric mean	(ref)	1.02x faster

All benchmarks:

Benchmark	main-jit	uop-stats-20-mc-or-better-pairs-83
unpack_sequence	136 ns	97.7 ns: 1.39x faster
bench_mp_pool	42.0 ms	32.9 ms: 1.28x faster
pathlib	21.8 ms	19.1 ms: 1.15x faster
async_tree_eager_tg	83.6 ms	77.0 ms: 1.09x faster
deepcopy	378 us	349 us: 1.08x faster
nbody	85.3 ms	78.9 ms: 1.08x faster
deltablue	3.77 ms	3.52 ms: 1.07x faster
hexiom	8.28 ms	7.76 ms: 1.07x faster
scimark_monte_carlo	73.2 ms	69.4 ms: 1.05x faster
async_tree_io	1.12 sec	1.06 sec: 1.05x faster
tomli_loads	2.24 sec	2.13 sec: 1.05x faster
raytrace	307 ms	292 ms: 1.05x faster
dask	687 ms	654 ms: 1.05x faster
deepcopy_reduce	3.31 us	3.17 us: 1.04x faster
telco	8.19 ms	7.88 ms: 1.04x faster
async_tree_memoization_tg	551 ms	530 ms: 1.04x faster
async_tree_eager_memoization	275 ms	264 ms: 1.04x faster
mako	11.4 ms	11.0 ms: 1.04x faster
pyflate	507 ms	489 ms: 1.04x faster
spectral_norm	115 ms	111 ms: 1.04x faster
docutils	2.74 sec	2.65 sec: 1.03x faster
richards_super	56.0 ms	54.2 ms: 1.03x faster
deepcopy_memo	37.0 us	35.9 us: 1.03x faster
genshi_text	24.9 ms	24.2 ms: 1.03x faster
meteor_contest	101 ms	98.3 ms: 1.03x faster
scimark_fft	331 ms	322 ms: 1.03x faster
pickle_dict	32.3 us	31.5 us: 1.03x faster
logging_simple	6.95 us	6.77 us: 1.03x faster
crypto_pyaes	76.2 ms	74.2 ms: 1.03x faster
async_tree_io_tg	1.11 sec	1.08 sec: 1.03x faster
asyncio_tcp	394 ms	385 ms: 1.02x faster
async_tree_eager_memoization_tg	206 ms	201 ms: 1.02x faster
scimark_lu	145 ms	142 ms: 1.02x faster
coverage	370 ms	361 ms: 1.02x faster
logging_format	8.01 us	7.84 us: 1.02x faster
sqlglot_optimize	60.2 ms	58.9 ms: 1.02x faster
chameleon	6.51 ms	6.38 ms: 1.02x faster
generators	26.7 ms	26.2 ms: 1.02x faster
float	78.1 ms	76.6 ms: 1.02x faster
async_tree_none_tg	423 ms	415 ms: 1.02x faster
sqlglot_transpile	1.63 ms	1.60 ms: 1.02x faster
regex_compile	172 ms	169 ms: 1.02x faster
sqlite_synth	2.56 us	2.52 us: 1.02x faster
2to3	297 ms	292 ms: 1.02x faster
pickle_pure_python	291 us	286 us: 1.02x faster
sqlglot_normalize	115 ms	113 ms: 1.02x faster
typing_runtime_protocols	122 us	120 us: 1.02x faster
async_tree_memoization	541 ms	533 ms: 1.02x faster
go	151 ms	149 ms: 1.01x faster
xml_etree_process	63.4 ms	62.5 ms: 1.01x faster
richards	49.3 ms	48.7 ms: 1.01x faster
dulwich_log	83.9 ms	83.0 ms: 1.01x faster
sqlglot_parse	1.31 ms	1.29 ms: 1.01x faster
json_loads	27.0 us	26.7 us: 1.01x faster
json_dumps	10.6 ms	10.5 ms: 1.01x faster
pickle	11.7 us	11.6 us: 1.01x faster
pickle_list	4.96 us	4.93 us: 1.01x faster
comprehensions	18.2 us	18.1 us: 1.01x faster
asyncio_tcp_ssl	1.33 sec	1.32 sec: 1.00x faster
regex_effbot	2.74 ms	2.75 ms: 1.01x slower
regex_dna	160 ms	161 ms: 1.01x slower
coroutines	22.2 ms	22.3 ms: 1.01x slower
async_generators	419 ms	423 ms: 1.01x slower
async_tree_cpu_io_mixed_tg	733 ms	741 ms: 1.01x slower
asyncio_websockets	442 ms	447 ms: 1.01x slower
async_tree_eager_cpu_io_mixed	473 ms	478 ms: 1.01x slower
genshi_xml	62.6 ms	63.3 ms: 1.01x slower
scimark_sor	140 ms	142 ms: 1.01x slower
create_gc_cycles	1.06 ms	1.08 ms: 1.02x slower
async_tree_eager_io	1.06 sec	1.09 sec: 1.02x slower
regex_v8	21.4 ms	21.9 ms: 1.02x slower
async_tree_eager_cpu_io_mixed_tg	409 ms	420 ms: 1.03x slower
python_startup_no_site	12.4 ms	12.8 ms: 1.03x slower
Geometric mean	(ref)	1.02x faster

Benchmark hidden because not significant (24): pprint_safe_repr, fannkuch, nqueens, unpickle, gc_traversal, xml_etree_iterparse, python_startup, scimark_sparse_mat_mult, tornado_http, chaos, unpickle_list, unpickle_pure_python, async_tree_cpu_io_mixed, logging_silent, pprint_pformat, async_tree_eager, xml_etree_generate, pidigits, async_tree_none, html5lib, mdp, async_tree_eager_io_tg, xml_etree_parse, bench_thread_pool

Using the 251 pairs* that reduce the machine code count by 13 or more: ~4% Faster (Local)

Benchmarks with tag 'apps':

Benchmark	main-jit	uop-stats-13-mc-or-better-pairs-251
2to3	297 ms	284 ms: 1.04x faster
docutils	2.74 sec	2.62 sec: 1.05x faster
Geometric mean	(ref)	1.02x faster

Benchmark hidden because not significant (3): chameleon, html5lib, tornado_http

Benchmarks with tag 'asyncio':

Benchmark	main-jit	uop-stats-13-mc-or-better-pairs-251
async_tree_eager_tg	83.6 ms	77.5 ms: 1.08x faster
async_tree_memoization_tg	551 ms	522 ms: 1.05x faster
async_tree_io_tg	1.11 sec	1.06 sec: 1.05x faster
async_tree_io	1.12 sec	1.07 sec: 1.05x faster
async_tree_memoization	541 ms	518 ms: 1.04x faster
async_tree_eager_cpu_io_mixed_tg	409 ms	392 ms: 1.04x faster
async_tree_eager_cpu_io_mixed	473 ms	455 ms: 1.04x faster
async_tree_eager_memoization	275 ms	265 ms: 1.04x faster
async_tree_cpu_io_mixed	732 ms	708 ms: 1.03x faster
async_tree_cpu_io_mixed_tg	733 ms	718 ms: 1.02x faster
async_tree_eager_memoization_tg	206 ms	202 ms: 1.02x faster
async_tree_eager	116 ms	114 ms: 1.02x faster
Geometric mean	(ref)	1.03x faster

Benchmark hidden because not significant (4): async_tree_none, async_tree_eager_io, async_tree_none_tg, async_tree_eager_io_tg

Benchmarks with tag 'math':

Benchmark	main-jit	uop-stats-13-mc-or-better-pairs-251
nbody	85.3 ms	77.2 ms: 1.11x faster
float	78.1 ms	74.9 ms: 1.04x faster
pidigits	191 ms	183 ms: 1.04x faster
Geometric mean	(ref)	1.06x faster

Benchmarks with tag 'regex':

Benchmark	main-jit	uop-stats-13-mc-or-better-pairs-251
regex_compile	172 ms	162 ms: 1.06x faster
regex_dna	160 ms	161 ms: 1.00x slower
regex_v8	21.4 ms	21.8 ms: 1.02x slower
Geometric mean	(ref)	1.01x faster

Benchmark hidden because not significant (1): regex_effbot

Benchmarks with tag 'serialize':

Benchmark	main-jit	uop-stats-13-mc-or-better-pairs-251
tomli_loads	2.24 sec	2.06 sec: 1.09x faster
pickle_dict	32.3 us	31.5 us: 1.03x faster
unpickle_pure_python	244 us	238 us: 1.03x faster
xml_etree_parse	142 ms	140 ms: 1.01x faster
pickle_list	4.96 us	4.89 us: 1.01x faster
pickle_pure_python	291 us	287 us: 1.01x faster
xml_etree_iterparse	103 ms	102 ms: 1.01x faster
pickle	11.7 us	11.5 us: 1.01x faster
xml_etree_process	63.4 ms	62.9 ms: 1.01x faster
json_dumps	10.6 ms	10.5 ms: 1.01x faster
xml_etree_generate	94.0 ms	93.5 ms: 1.01x faster
unpickle	15.6 us	15.8 us: 1.01x slower
Geometric mean	(ref)	1.02x faster

Benchmark hidden because not significant (2): unpickle_list, json_loads

Benchmarks with tag 'startup':

Benchmark	main-jit	uop-stats-13-mc-or-better-pairs-251
python_startup	13.4 ms	13.6 ms: 1.02x slower
python_startup_no_site	12.4 ms	12.7 ms: 1.02x slower
Geometric mean	(ref)	1.02x slower

Benchmarks with tag 'template':

Benchmark	main-jit	uop-stats-13-mc-or-better-pairs-251
mako	11.4 ms	10.8 ms: 1.05x faster
genshi_text	24.9 ms	23.9 ms: 1.04x faster
genshi_xml	62.6 ms	61.3 ms: 1.02x faster
Geometric mean	(ref)	1.04x faster

All benchmarks:

Benchmark	main-jit	uop-stats-13-mc-or-better-pairs-251
unpack_sequence	136 ns	98.7 ns: 1.38x faster
pathlib	21.8 ms	18.9 ms: 1.16x faster
scimark_monte_carlo	73.2 ms	65.0 ms: 1.13x faster
pyflate	507 ms	450 ms: 1.13x faster
hexiom	8.28 ms	7.43 ms: 1.11x faster
nbody	85.3 ms	77.2 ms: 1.11x faster
deltablue	3.77 ms	3.44 ms: 1.10x faster
raytrace	307 ms	281 ms: 1.09x faster
tomli_loads	2.24 sec	2.06 sec: 1.09x faster
pprint_safe_repr	946 ms	870 ms: 1.09x faster
pprint_pformat	1.99 sec	1.83 sec: 1.08x faster
deepcopy	378 us	349 us: 1.08x faster
async_tree_eager_tg	83.6 ms	77.5 ms: 1.08x faster
spectral_norm	115 ms	107 ms: 1.07x faster
scimark_fft	331 ms	311 ms: 1.07x faster
crypto_pyaes	76.2 ms	71.5 ms: 1.07x faster
regex_compile	172 ms	162 ms: 1.06x faster
dask	687 ms	646 ms: 1.06x faster
chaos	69.1 ms	65.1 ms: 1.06x faster
richards_super	56.0 ms	52.8 ms: 1.06x faster
richards	49.3 ms	46.5 ms: 1.06x faster
async_tree_memoization_tg	551 ms	522 ms: 1.05x faster
mako	11.4 ms	10.8 ms: 1.05x faster
telco	8.19 ms	7.79 ms: 1.05x faster
deepcopy_reduce	3.31 us	3.15 us: 1.05x faster
deepcopy_memo	37.0 us	35.2 us: 1.05x faster
async_tree_io_tg	1.11 sec	1.06 sec: 1.05x faster
async_tree_io	1.12 sec	1.07 sec: 1.05x faster
docutils	2.74 sec	2.62 sec: 1.05x faster
async_tree_memoization	541 ms	518 ms: 1.04x faster
coverage	370 ms	354 ms: 1.04x faster
2to3	297 ms	284 ms: 1.04x faster
nqueens	97.4 ms	93.4 ms: 1.04x faster
async_tree_eager_cpu_io_mixed_tg	409 ms	392 ms: 1.04x faster
genshi_text	24.9 ms	23.9 ms: 1.04x faster
float	78.1 ms	74.9 ms: 1.04x faster
logging_simple	6.95 us	6.68 us: 1.04x faster
pidigits	191 ms	183 ms: 1.04x faster
sqlglot_optimize	60.2 ms	57.9 ms: 1.04x faster
sqlglot_transpile	1.63 ms	1.57 ms: 1.04x faster
async_tree_eager_cpu_io_mixed	473 ms	455 ms: 1.04x faster
logging_format	8.01 us	7.72 us: 1.04x faster
fannkuch	423 ms	408 ms: 1.04x faster
async_tree_eager_memoization	275 ms	265 ms: 1.04x faster
comprehensions	18.2 us	17.6 us: 1.04x faster
scimark_sparse_mat_mult	4.82 ms	4.66 ms: 1.03x faster
async_tree_cpu_io_mixed	732 ms	708 ms: 1.03x faster
go	151 ms	147 ms: 1.03x faster
meteor_contest	101 ms	98.4 ms: 1.03x faster
sqlglot_parse	1.31 ms	1.27 ms: 1.03x faster
pickle_dict	32.3 us	31.5 us: 1.03x faster
unpickle_pure_python	244 us	238 us: 1.03x faster
sqlite_synth	2.56 us	2.49 us: 1.03x faster
gc_traversal	3.31 ms	3.22 ms: 1.03x faster
scimark_lu	145 ms	142 ms: 1.02x faster
dulwich_log	83.9 ms	82.1 ms: 1.02x faster
async_tree_cpu_io_mixed_tg	733 ms	718 ms: 1.02x faster
sqlglot_normalize	115 ms	112 ms: 1.02x faster
genshi_xml	62.6 ms	61.3 ms: 1.02x faster
async_tree_eager_memoization_tg	206 ms	202 ms: 1.02x faster
create_gc_cycles	1.06 ms	1.05 ms: 1.02x faster
async_tree_eager	116 ms	114 ms: 1.02x faster
xml_etree_parse	142 ms	140 ms: 1.01x faster
pickle_list	4.96 us	4.89 us: 1.01x faster
asyncio_tcp_ssl	1.33 sec	1.31 sec: 1.01x faster
pickle_pure_python	291 us	287 us: 1.01x faster
xml_etree_iterparse	103 ms	102 ms: 1.01x faster
pickle	11.7 us	11.5 us: 1.01x faster
generators	26.7 ms	26.5 ms: 1.01x faster
xml_etree_process	63.4 ms	62.9 ms: 1.01x faster
json_dumps	10.6 ms	10.5 ms: 1.01x faster
xml_etree_generate	94.0 ms	93.5 ms: 1.01x faster
asyncio_websockets	442 ms	441 ms: 1.00x faster
regex_dna	160 ms	161 ms: 1.00x slower
async_generators	419 ms	421 ms: 1.01x slower
unpickle	15.6 us	15.8 us: 1.01x slower
coroutines	22.2 ms	22.4 ms: 1.01x slower
python_startup	13.4 ms	13.6 ms: 1.02x slower
regex_v8	21.4 ms	21.8 ms: 1.02x slower
logging_silent	103 ns	105 ns: 1.02x slower
python_startup_no_site	12.4 ms	12.7 ms: 1.02x slower
Geometric mean	(ref)	1.04x faster

Benchmark hidden because not significant (16): bench_mp_pool, async_tree_none, tornado_http, typing_runtime_protocols, chameleon, unpickle_list, json_loads, asyncio_tcp, scimark_sor, regex_effbot, async_tree_eager_io, async_tree_none_tg, mdp, html5lib, async_tree_eager_io_tg, bench_thread_pool

*Not including pairs that include _JUMP_TO_TOP - currently, including that in a superinstruction segfaults.

JeffersGlass · 2024-07-03T16:26:14Z

I thought I'd share the an update on the Superinstruction experiments. I've started from scratch in this branch by teaching the bytecode analyzer to understand superinstructions:

// supernodes.c
super() = _LOAD_FAST_1 + _GUARD_BOTH_INT
super() = _TIER2_RESUME_CHECK + _SET_IP
super() = _BINARY_OP_ADD_INT + _LOAD_CONST_INLINE_BORROW

Which allows it to calculate and emit metadata and IDs the same way it handles bytecodes:

[_LOAD_FAST_1_PLUS__GUARD_BOTH_INT] = HAS_LOCAL_FLAG | HAS_EXIT_FLAG,
[_TIER2_RESUME_CHECK_PLUS__SET_IP] = HAS_DEOPT_FLAG | HAS_OPERAND_FLAG,
[_BINARY_OP_ADD_INT_PLUS__LOAD_CONST_INLINE_BORROW] = HAS_ERROR_FLAG | HAS_PURE_FLAG | HAS_OPERAND_FLAG,
...

#define _WITH_EXCEPT_START WITH_EXCEPT_START
#define _YIELD_VALUE YIELD_VALUE
#define MAX_VANILLA_UOP_ID 451

#define _LOAD_FAST_1_PLUS__GUARD_BOTH_INT 452
#define _TIER2_RESUME_CHECK_PLUS__SET_IP 453
#define _BINARY_OP_ADD_INT_PLUS__LOAD_CONST_INLINE_BORROW 454
#define MAX_UOP_ID 454

We generate a switch statement in a similar way to previous experiments (in jit_switch_generator.py), but this time that function is included via a header file instead of modifying template.c directly, making it much more readable.

//jit_switch.c
SuperNode
_JIT_INDEX(const _PyUOpInstruction *uops, uint16_t start_index) {
    switch (uops[start_index + 0].opcode) {
        case _LOAD_FAST_1:
            switch (uops[start_index + 1].opcode) {
                case _GUARD_BOTH_INT:
                    return (SuperNode) {.index = _LOAD_FAST_1_PLUS__GUARD_BOTH_INT, .length = 2};
                    break;
                ...

The biggest new experiment is automatically iterating on sets of supernodes. Tools/scripts/supernode_analysis.py contains tools for analyzing a set of pystats and deriving the a new set of supernodes from the given data. For instance:

# Run (up to) 5 generations of build/run-pystats/derive-new-supernodes, using 4 threads for builds, verbose=1, running only the docutils benchmark
$ python Tools/scripts/supernode_analysis.py iterate -v -i5 -j4 -b docutils

Beginning supernode generation process for 10 iterations max
Starting supernode generation 1 of 10
  Generating statistics
  Updating supernode metadata and building JIT
  Generating supernodes from stats
  Added 187 of 1224 possible supernodes that make up more than 0.1% of nodes and are viable
  Updating supernode metadata and building JIT
  Added 128 supernodes
Starting supernode generation 2 of 10
  Generating statistics
  Stat-ing python with 128 of 128 nodes
  Updating supernode metadata and building JIT
  ...

# see help for full description
$ python Tools/scripts/supernode_analysis --help

There are still some bugs floating around in superinstruction construction/usage (and possibly in Tier 2 itself?), so this script will, by default, detect errors during Python builds and during the pystats runs, bisect to find the troublesome superinstructions, and remove them from the run:

  ...
  Stat-ing python with 28 of 28 nodes
  Updating supernodes.c
  Updating supernode metadata and building JIT
  Stat FAILED, bisecting
  Stat-ing python with 14 of 28 nodes
  Updating supernodes.c
  ...
  Identified bad node during stat: _START_EXECUTOR_PLUS__POP_TOP
  Building Python with 27 nodes
  ...

There's much more to be done - I'm tracking granular todos in an issue on my fork, but some big areas of investigation:

How to select supernodes using pystats. Currently, there's a hardcoded percentage threshhold to add a pair as a new supernode, and a (lower) threshhold to drop an existing supernode. These should at least be tuned, but probably there are smarter metrics. Possibly data on how much each supernode decreases byte length should be incorporated, but unclear at what stage.
Supernode format - the use of one oparg/operand/target per superinstruction may prove limiting in the long run - the longest superinstruction I've seen 'in the wild' is 7 instructions (_CHECK_PERIODIC + _CHECK_VALIDITY + _STORE_FAST_4 + _LOAD_FAST_3 + _LOAD_FAST_4 + _LOAD_FAST_2 + _BUILD_TUPLE), but longer sequences are surely possible with a more flexible format.
How efficient is supernode selection at patch time - i.e. the giant generated switch statement that hopefully the compiler is optimizing.
Currently linux/mac, need to work on Windows build steps. And if byte-length data is used, need to assess its relative merits across various platforms.

It was a pleasure to meet so many in this thread at PyConUS in May - thanks for conversations during the open spaces and sprints. I hope these experiments prove useful - I will share more observations and results as they pop up.

Fidget-Spinner · 2024-12-21T11:51:58Z

FWIW, I gave another short at super instructions. With the newest JIT, it shows no speedup on my computer https://github.com/Fidget-Spinner/cpython/pull/new/Fidget-Spinner:cpython:supernodes. Even on small microbenchmarks (e.g. iterative fibonacci). This branch's super instructions support up to 7 instructions (14 operands!). (Last working commit: d95dcdd55106f7f228ac8c84a979d52cfeb4578b)

I suspect most of the previous wins was from removing the zero-length jumps. Since the JIT is now emitting more efficient code, I don't think we need this anymore.

Fidget-Spinner · 2024-12-21T16:48:18Z

I tried a true "baseline" JIT: that is, turning tier 1 bytecode directly to JIT stencils where possible. https://github.com/Fidget-Spinner/cpython/pull/new/Fidget-Spinner:cpython:tier1_baseline

This should be the a really strong case for superinstructions. However, there's almost no speedup there too on iterative fibonacci.

Fidget-Spinner · 2024-12-29T16:53:38Z

I pulled out the copy and patch paper again and indeed this corresponds to what i've found: only a 1% speedup on fibonacci:

(Source: Copy-and-Patch Compilation by Haoran Xu and Fredrik Kjolstad)

I deem this not worth the implementation effort (and extra build time). Tier 2 build time (ie make) is a lot slower on my machine due to having to support a lot more stencils.

Seems like regalloc is the next most promising optimization.

JeffersGlass mentioned this issue Feb 8, 2024

Add UOp Pair counts to pystats python/cpython#115178

Closed

JeffersGlass mentioned this issue Feb 8, 2024

gh-115178: Add Counts of UOp Pairs to pystats python/cpython#115181

Merged

Fidget-Spinner mentioned this issue Jan 28, 2025

Tier 2 superuops python/cpython#116202

Closed

Superinstructions for Copy & Patch JIT #647

Superinstructions for Copy & Patch JIT #647

Comments

JeffersGlass commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Places for Improvment

Better Superinstruction Choices

Better Superinstruction Selection at JIT-Time

Benchmarking

Cross-Compilation

brandtbucher commented Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeffersGlass commented Jan 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeffersGlass commented Jan 29, 2024

mainall.json

jitall.json

2to3

async_generators

async_tree_cpu_io_mixed

async_tree_cpu_io_mixed_tg

async_tree_eager

async_tree_eager_cpu_io_mixed

async_tree_eager_cpu_io_mixed_tg

async_tree_eager_io

async_tree_eager_io_tg

async_tree_eager_memoization

async_tree_eager_memoization_tg

async_tree_eager_tg

async_tree_io

async_tree_io_tg

async_tree_memoization

async_tree_memoization_tg

async_tree_none

async_tree_none_tg

asyncio_tcp

asyncio_tcp_ssl

asyncio_websockets

bench_mp_pool

bench_thread_pool

chameleon

chaos

comprehensions

coroutines

create_gc_cycles

crypto_pyaes

dask

deepcopy

deepcopy_memo

deepcopy_reduce

deltablue

docutils

dulwich_log

fannkuch

float

gc_traversal

generators

genshi_text

genshi_xml

go

hexiom

html5lib

json_dumps

json_loads

logging_format

logging_silent

logging_simple

mako

mdp

meteor_contest

nbody

nqueens

pathlib

pickle

pickle_dict

pickle_list

pickle_pure_python

JeffersGlass commented Jan 25, 2024 •

edited

Loading

brandtbucher commented Jan 26, 2024 •

edited

Loading

JeffersGlass commented Jan 28, 2024 •

edited

Loading

Fidget-Spinner commented Jan 29, 2024 •

edited

Loading

JeffersGlass commented Jan 29, 2024 •

edited

Loading

Fidget-Spinner commented Jan 30, 2024 •

edited

Loading

mdboom commented Jan 31, 2024 •

edited

Loading