Implement CALL_FUNCTION specialization "family" for PEP 659 adaptive interpreter #54

markshannon · 2021-05-21T12:49:23Z

There are a number of specializations of CALL_FUNCTION that make sense:

Calls to a Python function where the arguments and parameters match exactly.
Calls to a Python function with keyword arguments, or defaults, when the argument shuffle can be pre-computed.
Calls to builtin functions.
Calls to specific builtin functions, specifically len and instance.
Calls to type with a single argument.
Instantiation of a "normal" class, that is a class that does not redefine __new__ and whose metaclass is type.

The text was updated successfully, but these errors were encountered:

Fidget-Spinner · 2021-06-26T10:01:52Z

Calls to builtin functions.

Calls to specific builtin functions, specifically len and instance.

Hi Mark, I tried to cover these three with a single CALL_CFUNCTION opcode. It should generally speed up calls to any C function.
Diff here: python/cpython@main...Fidget-Spinner:call_function_specialize

I wonder if my approach is the right one? Would you like me to create a PR against CPython?

Edit:
5-15% less call overhead on pyperf microbenchmarks, with system tune and LTO+PGO https://gist.github.com/Fidget-Spinner/b2f05e0f6b8851d41e09412f5189ce8f. Using this benchmarking script https://github.com/Fidget-Spinner/tools/blob/time_builtin_calls/scripts/time_builtin_calls.py . A sample of what gets optimized at python startup is also here https://gist.github.com/Fidget-Spinner/b2f05e0f6b8851d41e09412f5189ce8f#gistcomment-3794651 .

markshannon · 2021-07-14T14:39:49Z

My experimental implementation of specialization of FUNCTION_CALL had the following specializations:

CALL_FUNCTION_PY_POSITIONAL_ONLY (Calls to Python functions with all arguments provided positionally)
CALL_FUNCTION_PY_WITH_ARGMAP (Calls to Python functions with a few keyword arguments and/or defaults)
CALL_FUNCTION_ALLOCATE_AND_INIT (Calls to Python classes that don't implement __new__)
CALL_FUNCTION_BUILTIN_0 (Implemented by bpo-44525: Specialize CALL_FUNCTION for C function calls python/cpython#26934)
CALL_FUNCTION_BUILTIN_NOARGS
CALL_FUNCTION_BUILTIN_FASTCALL

I didn't do a very thorough analysis of the benefits of the above specializations.
@Fidget-Spinner's tests suggest that CALL_FUNCTION_BUILTIN_FASTCALL isn't worth it.

Fidget-Spinner · 2021-07-14T14:48:42Z

The very first iteration of my PR also had the equivalent of CALL_FUNCTION_BUILTIN_NOARGS and they showed no change in microbenchmarks either. Hopefully this helps narrow down what we need to specialize :).

markshannon · 2021-07-14T14:52:22Z

Another thing to consider is specializing for calls to particular builtin objects, e.g. type, isinstance, and len.

Fidget-Spinner · 2021-07-27T17:56:24Z

Mark, I've recently taken a stab at simple python functions. Can I hear your thoughts on how you intend to optimize this scenario?

Calls to a Python function where the arguments and parameters match exactly.

I see some micro-optimizations possible in initialize_locals, but I don't see where else. Your help would be much appreciated.

Fidget-Spinner · 2021-07-29T15:54:56Z

Some updates: Simple python function specialization (CALL_FUNCTION_PY_SIMPLE aka CALL_FUNCTION_PY_POSITIONAL_ONLY) yielded a 10% speedup in microbenchmarks on Windows with PGO [1]. I'm not confident this will move pyperformance but I'll go benchmark anyways. My personal rule of thumb is we need a >20% speedup on microbenchmarks (depending on how common the opcode is) to affect pyperformance.

Additionally, many of the specializations for CALL_FUNCTION can be easily refactored and reused for CALL_METHOD. So I may try that and see if we can move pyperformance.

[1]

"python.exe"  -m timeit -s "x = lambda a:a" "x(1)"

Main:
5000000 loops, best of 5: 56.1 nsec per loop

Specialized:
5000000 loops, best of 5: 50.9 nsec per loop

Very rough WIP commit. I'm probably not optimizing very well Fidget-Spinner/cpython@d8f3330

markshannon · 2021-07-30T09:43:57Z

My somewhat informal analysis of specializing CALL_FUNCTION is that calls are >90% (probably > 95%) are one of:

Python functions
Builtin functions
Classes (both instantiations and calls to type and str)
With the first two dominating.

Specializing python functions without builtin functions, or vice versa, will pay a relatively large cost for CALL_FUNCTION_ADAPTIVE. Specializing both should show a speedup.

@Fidget-Spinner I'm going to implement CALL_FUNCTION_PY_SIMPLE as there are some nice optimizations we can make by overlapping frames. Hopefully it will show a speed up, and then we can merge python/cpython#26934.

Fidget-Spinner · 2021-07-30T10:51:47Z

@Fidget-Spinner I'm going to implement CALL_FUNCTION_PY_SIMPLE as there are some nice optimizations we can make by overlapping frames. Hopefully it will show a speed up, and then we can merge python/cpython#26934.

@markshannon alright. Really looking forward to that :). Btw, I noticed that quite a few CALL_FUNCTION calls in pyperformance are to PyMethodObject s too. Simple method calls can probably reuse code from CALL_FUNCTION_PY_SIMPLE.

markshannon · 2021-08-03T13:00:27Z

Here are the numbers from running the benchmark suite. (May include some of the machinery, not just the benchmarks)

2 billion CALL_FUNCTION or quickened variants.

32.5M specialization attempts
Breakdown:

Type	attempts	percentage
Builtin function	20.1M	61.9%
Class	5.95M	18.4%
Python function	4.8M	14.9%
Bound method	1.2M	3.7%
Other	0.3M	1.1%

Which explains why specializing Python calls doesn't have as much effect as we might expect.

markshannon · 2021-08-03T13:58:32Z

Further breaking down the builtin function percentages:

Flags	percentage
METH_O	30.2%
METH_FASTCALL	26.6%
METH_FASTCALL \| METH_KEYWORDS	2.3%
METH_VARARGS (inc. METH_KEYWORDS)	2.0%
METH_NOARGS	0.8%

gvanrossum · 2021-08-03T14:26:29Z

Very cool to know. Next question, what’s the breakdown per benchmark?

Fidget-Spinner · 2021-08-03T14:27:09Z

2 billion CALL_FUNCTION or quickened variants.
32.5M specialization attempts

Thanks for crunching the numbers Mark! Just to clarify: those are specialization attempts, not runtime execution counts right?

I suspect while the number of specialization sites for python functions and bound methods are low, the frequency of instruction execution is high. Specialization attempts are sometimes deceptive -- e.g. my experiments in METH_FASTCALL showed that at specialization time, ~25% had 2 or 3 arguments, but at instruction execution time, that was reduced to ~10% (python/cpython#26934 (comment))

markshannon · 2021-08-03T17:05:09Z

The 2 billion is the execution count. The rest are specialization attempts; multiply by ~64 to get an estimate of execution counts.

I don't see how specialization numbers would that different, or even differ by more than a small fraction.
Without any actual specialization, the specialization attempts are just samples of execution.

Fidget-Spinner · 2021-08-04T16:32:55Z

Thanks for clearing my misunderstandings Mark!

I don't have runtime specialization data for the following, but this should be useful to know which benchmarks to expect some speedup. From dxpairs data in the faster-cpython/tools repo,

Top benchmarks for CALL_FUNCTION (truncated <5%):

 bm_mako.json                      616,984   9.56%
 bm_django_template.json         1,016,938   6.09%
 mypy.json                      68,488,372   6.06%
 bm_regex_effbot.json               50,195   5.69%
 bm_chameleon.json                 505,263   5.69%
 bm_xml_etree.json               4,702,957   5.60%

CALL_METHOD (truncated <5%):

bm_mako.json                      468,476   7.26%
bm_deltablue.json                 186,465   6.69%
bm_richards.json                1,465,382   5.25%

Sanity check with BINARY_SUBSCR, since that was a successful specialization:

bm_crypto_pyaes.json            3,723,498  12.14%
bm_nqueens.json                 3,553,021   9.18%
bm_fannkuch.json               17,918,473   8.16%
bm_scimark.json                23,790,519   6.90%
bm_hexiom.json                    274,217   5.75%
bm_nbody.json                   4,509,873   5.59%

The BINARY_SUBSCR stats roughly corroborate with the benchmarks which sped up.

I'll be very shocked if specializing CALL_FUNCTION doesn't leave a dent in benchmarks. Especially considering calls are so much slower than BINARY_SUBSCR, which showed benchmark speedups with similar percentages. Mark, if you don't mind, could you please point me to your branch for CALL_FUNCTION_PY_SIMPLE with overlapping frames? I'd like to try collect more info.

markshannon · 2021-08-09T11:56:28Z

First, to summarize what I now believe the benchmarks and other evidence is telling us:

The vectorcall protocol for calling C functions is fast. Any specialization that adds overhead to that is likely to make things worse.
Specializing for specific cases (exactly one argument to METH_O, and maybe exactly two arguments to METH_FASTCALL) will produce a small speedup for those calls.
The ADAPTIVE form is not free, so unless we get high hit ratios the additional cost of the ADAPTIVE form will eat any gains made by specialization

While that might suggest that it isn't worth specializing CALL_FUNCTION, I still think that it is, but we will need to slice it a bit differently:

Specialize for a few specific callables, e.g. type and len, that are internally simple. Removing the call overhead and inlining should give good speedups in those cases.
Specialize for Python functions, because we will be unable to use vectorcall if we want to avoid making C calls for Python calls.
Specialize for the one or two cases of builtin functions that show some speedup (METH_O and maybe METH_FASTCALL with 2 arguments)
Specialize for vectorcall, just to avoid the ADAPTIVE overhead for most of the remaining cases.

There is a downside to specializing for vectorcall: we lose the adaptive nature of the specializing interpreter. Once we have specialized on the vectorcall call, we are unlikely to escape that state, because all the callables we want to specialize for implement vectorcall.

Because (3) above depends on removing C calls when making Python calls, we may need to wait until that is implemented until we get any worthwhile speedups for CALL_FUNCTION.

markshannon · 2021-08-27T14:28:56Z

Latest data running pyperformance, using https://github.com/faster-cpython/cpython/tree/specialize-call-function-stats

2 billion CALL_FUNCTION or quickened variants.

31.7M specialization attempts
Breakdown:

Callee	percentage
len	12.4%
isinstance	15.6%
Other C function METH_O	18.3%
C function METH_FAST (2 arguments)	6.5%
C function METH_FAST(other arguments)	5.1%
C function (all other forms)	4.7%
Python function (simple, matching arguments)	13.4%
Python function (other)	1.2%
type (one argument)	2.5%
other classes	15.4%
bound method	3.8%
other	1.1%

Fidget-Spinner · 2021-09-18T03:33:05Z

I've somewhat given up on this :(. I updated the PR with specialized bytecode just for len and isinstance, and even that showed 0 speedup in pyperformance:

https://gist.github.com/Fidget-Spinner/07d6a123102c9ff8819f52bb8f9b57a6

gvanrossum · 2021-09-22T00:27:18Z

Never mind the benchmarks, do you have statistics about what fraction of CALL_FUNCTION opcodes are actually specialized? As Mark explained, if too many calls go through the ADAPTIVE variant and end up being deoptimized, that's expensive. This may explain why just specializing len and isinstance doesn't do us much good.

In any case, @pablogsal is working on optimizing specifically Python-to-Python calls. I don't have a link to his code yet, and I don't know exactly what his approach is, but he said he had it working for a simple recursive factorial() implementation and reported that it gave great speedup for that (1.7x according to my notes!).

Fidget-Spinner · 2021-09-22T12:28:45Z

Never mind the benchmarks, do you have statistics about what fraction of CALL_FUNCTION opcodes are actually specialized?

According to Mark's stats above for pyperformance, more than 50% should be specialized. Running a workload of python -m test test_typing test_re test_dis test_zlib gives me 40% (unless I misinterpreted the stats, it's 985215 / (710360 + 724450 + 985215)). Very few deoptimizations for function calls.

    call_function.specialization_success : 810
    call_function.specialization_failure : 11976
    call_function.hit : 985215
    call_function.deferred : 710360
    call_function.miss : 9
    call_function.deopt : 3
    call_function.unquickened : 724450
    call_function.specialization_failure_kinds[10] : 21
    call_function.specialization_failure_kinds[11] : 3119  # Bound method 
    call_function.specialization_failure_kinds[12] : 170
    call_function.specialization_failure_kinds[13] : 1475  # PyCFunction with keywords
    call_function.specialization_failure_kinds[14] : 176 
    call_function.specialization_failure_kinds[15] : 26
    call_function.specialization_failure_kinds[16] : 0
    call_function.specialization_failure_kinds[17] : 3131  # Python function
    call_function.specialization_failure_kinds[18] : 3837  # Immutable class call, eg X()

My own experiments suggest the adaptive opcode overhead is only significant for things with already low calling overhead. Eg. vectorcall C functions without keywords. Normal python functions and classes don't have any measurable slowdown even in microbenchmarks because they're comparatively very expensive.

In any case, Pablo is working on optimizing specifically Python-to-Python calls.

Yeap, he has a PR up at python/cpython#28488. I find his approach really elegant (wish I'd thought of it :P) . When CALL_FUNCTION detects a python function, it doesn't call _PyEval_Vector and eventually _PyEval_EvalFrameDefault. Instead, it sets up a new interpreter frame inline then jumps to the start of _PyEval_EvalFrameDefault.

pablogsal · 2021-09-22T13:10:05Z

but he said he had it working for a simple recursive factorial() implementation and reported that it gave great speedup for that (1.7x according to my notes!).

Although I am very excited about it, as a disclaimer that was a quick micro-benchmark of the initial prototype (that still segfaulted with exceptions 🙂 ). I am running today the full benchmark suite with all optimizations. Although I don't expect that much I hope we get some good numbers 🤞

I find his approach really elegant

Well, is elegant when it works, but you cannot imagine the number of nightmarish bugs that we had to deal with meanwhile. Also, the original idea for this was originally suggested by @markshannon 😉 (several times, the first one in https://www.python.org/dev/peps/pep-0651) and we have been working together on designing this version of the code (I did the implementation), as the approach matters a lot.

gvanrossum · 2021-09-22T17:30:31Z

1.7x according to my notes

With the code fixed, and using PyPerformance with PGO-LTO, Pablo reported about 3% speed increase. So not quite exciting, but definitely significant (most of the changes we've made so far gain 1-3% in speed, with most around 1%).

markshannon · 2021-10-11T13:15:42Z

python/cpython#28488 actually caused a tiny slowdown.
https://gist.github.com/markshannon/e278f135dca55611ca982e42780bb4fd

There's quite a lot of cleaning up to do. So I'm not worried by the temporary slowdown.

pablogsal · 2021-10-11T13:45:18Z

python/cpython#28488 actually caused a tiny slowdown.
https://gist.github.com/markshannon/e278f135dca55611ca982e42780bb4fd

Hummm.... I am confused, this used to be much faster: python/cpython#28488 (comment) :(

Also, there is something weird there because in your benchmark it shows pydigits to be a lot slower:

pidigits | 320 ms | 357 ms: 1.12x slower

The benchmarks I ran were the first version that passed the test suite so maybe we did something in the review that made it slower (there were also some refleaks and we changed how we are handling the stealing of arguments, so this may have impacted the times).

Edit: I made another benchmark run and this is what I get:

https://gist.github.com/pablogsal/34b542cc7e8366bdcaa2c650c0542895

which is slower than the first version of the PR, but it doesn't have the extreme cases that you have in the other gists (pydigits and regex_effbot).

markshannon · 2022-01-25T14:21:57Z

python/cpython#30855 allows us to specialize almost all calls, so we can then consider this "implemented".
It isn't finished, but with the possible exception of LOAD_GLOBAL, none of the specialization are.
The bulk of the work is done, allowing small incremental improvements from here on.

markshannon mentioned this issue May 21, 2021

Adaptive, specializing interpreter #28

Closed

markshannon mentioned this issue Jul 27, 2021

Specializations #74

Closed

markshannon self-assigned this Nov 8, 2021

gramster added this to Fancy CPython Board Jan 10, 2022

gramster moved this to In Progress in Fancy CPython Board Jan 10, 2022

markshannon moved this from In Progress to In Review in Fancy CPython Board Jan 25, 2022

markshannon closed this as completed Feb 10, 2022

markshannon moved this from In Review to Done in Fancy CPython Board Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement CALL_FUNCTION specialization "family" for PEP 659 adaptive interpreter #54

Implement CALL_FUNCTION specialization "family" for PEP 659 adaptive interpreter #54

markshannon commented May 21, 2021

Fidget-Spinner commented Jun 26, 2021 •

edited

Loading

markshannon commented Jul 14, 2021

Fidget-Spinner commented Jul 14, 2021

markshannon commented Jul 14, 2021

Fidget-Spinner commented Jul 27, 2021

Fidget-Spinner commented Jul 29, 2021

markshannon commented Jul 30, 2021

Fidget-Spinner commented Jul 30, 2021 •

edited

Loading

markshannon commented Aug 3, 2021

markshannon commented Aug 3, 2021

gvanrossum commented Aug 3, 2021

Fidget-Spinner commented Aug 3, 2021 •

edited

Loading

markshannon commented Aug 3, 2021

Fidget-Spinner commented Aug 4, 2021 •

edited

Loading

markshannon commented Aug 9, 2021

markshannon commented Aug 27, 2021

Fidget-Spinner commented Sep 18, 2021

gvanrossum commented Sep 22, 2021

Fidget-Spinner commented Sep 22, 2021 •

edited

Loading

pablogsal commented Sep 22, 2021 •

edited

Loading

gvanrossum commented Sep 22, 2021

markshannon commented Oct 11, 2021

pablogsal commented Oct 11, 2021 •

edited

Loading

markshannon commented Jan 25, 2022

Implement CALL_FUNCTION specialization "family" for PEP 659 adaptive interpreter #54

Implement CALL_FUNCTION specialization "family" for PEP 659 adaptive interpreter #54

Comments

markshannon commented May 21, 2021

Fidget-Spinner commented Jun 26, 2021 • edited Loading

markshannon commented Jul 14, 2021

Fidget-Spinner commented Jul 14, 2021

markshannon commented Jul 14, 2021

Fidget-Spinner commented Jul 27, 2021

Fidget-Spinner commented Jul 29, 2021

markshannon commented Jul 30, 2021

Fidget-Spinner commented Jul 30, 2021 • edited Loading

markshannon commented Aug 3, 2021

markshannon commented Aug 3, 2021

gvanrossum commented Aug 3, 2021

Fidget-Spinner commented Aug 3, 2021 • edited Loading

markshannon commented Aug 3, 2021

Fidget-Spinner commented Aug 4, 2021 • edited Loading

markshannon commented Aug 9, 2021

markshannon commented Aug 27, 2021

Fidget-Spinner commented Sep 18, 2021

gvanrossum commented Sep 22, 2021

Fidget-Spinner commented Sep 22, 2021 • edited Loading

pablogsal commented Sep 22, 2021 • edited Loading

gvanrossum commented Sep 22, 2021

markshannon commented Oct 11, 2021

pablogsal commented Oct 11, 2021 • edited Loading

markshannon commented Jan 25, 2022

Fidget-Spinner commented Jun 26, 2021 •

edited

Loading

Fidget-Spinner commented Jul 30, 2021 •

edited

Loading

Fidget-Spinner commented Aug 3, 2021 •

edited

Loading

Fidget-Spinner commented Aug 4, 2021 •

edited

Loading

Fidget-Spinner commented Sep 22, 2021 •

edited

Loading

pablogsal commented Sep 22, 2021 •

edited

Loading

pablogsal commented Oct 11, 2021 •

edited

Loading