Skip to content

Implement CALL_FUNCTION specialization "family" for PEP 659 adaptive interpreter #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
markshannon opened this issue May 21, 2021 · 24 comments
Assignees

Comments

@markshannon
Copy link
Member

There are a number of specializations of CALL_FUNCTION that make sense:

  • Calls to a Python function where the arguments and parameters match exactly.
  • Calls to a Python function with keyword arguments, or defaults, when the argument shuffle can be pre-computed.
  • Calls to builtin functions.
  • Calls to specific builtin functions, specifically len and instance.
  • Calls to type with a single argument.
  • Instantiation of a "normal" class, that is a class that does not redefine __new__ and whose metaclass is type.
@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Jun 26, 2021

  • Calls to builtin functions.
  • Calls to specific builtin functions, specifically len and instance.

Hi Mark, I tried to cover these three with a single CALL_CFUNCTION opcode. It should generally speed up calls to any C function.
Diff here: python/cpython@main...Fidget-Spinner:call_function_specialize

I wonder if my approach is the right one? Would you like me to create a PR against CPython?

Edit:
5-15% less call overhead on pyperf microbenchmarks, with system tune and LTO+PGO https://gist.github.com/Fidget-Spinner/b2f05e0f6b8851d41e09412f5189ce8f. Using this benchmarking script https://github.com/Fidget-Spinner/tools/blob/time_builtin_calls/scripts/time_builtin_calls.py . A sample of what gets optimized at python startup is also here https://gist.github.com/Fidget-Spinner/b2f05e0f6b8851d41e09412f5189ce8f#gistcomment-3794651 .

@markshannon
Copy link
Member Author

My experimental implementation of specialization of FUNCTION_CALL had the following specializations:

  • CALL_FUNCTION_PY_POSITIONAL_ONLY (Calls to Python functions with all arguments provided positionally)
  • CALL_FUNCTION_PY_WITH_ARGMAP (Calls to Python functions with a few keyword arguments and/or defaults)
  • CALL_FUNCTION_ALLOCATE_AND_INIT (Calls to Python classes that don't implement __new__)
  • CALL_FUNCTION_BUILTIN_0 (Implemented by bpo-44525: Specialize CALL_FUNCTION for C function calls python/cpython#26934)
  • CALL_FUNCTION_BUILTIN_NOARGS
  • CALL_FUNCTION_BUILTIN_FASTCALL

I didn't do a very thorough analysis of the benefits of the above specializations.
@Fidget-Spinner's tests suggest that CALL_FUNCTION_BUILTIN_FASTCALL isn't worth it.

@Fidget-Spinner
Copy link
Collaborator

The very first iteration of my PR also had the equivalent of CALL_FUNCTION_BUILTIN_NOARGS and they showed no change in microbenchmarks either. Hopefully this helps narrow down what we need to specialize :).

@markshannon
Copy link
Member Author

Another thing to consider is specializing for calls to particular builtin objects, e.g. type, isinstance, and len.

@Fidget-Spinner
Copy link
Collaborator

Mark, I've recently taken a stab at simple python functions. Can I hear your thoughts on how you intend to optimize this scenario?

Calls to a Python function where the arguments and parameters match exactly.

I see some micro-optimizations possible in initialize_locals, but I don't see where else. Your help would be much appreciated.

@Fidget-Spinner
Copy link
Collaborator

Some updates: Simple python function specialization (CALL_FUNCTION_PY_SIMPLE aka CALL_FUNCTION_PY_POSITIONAL_ONLY) yielded a 10% speedup in microbenchmarks on Windows with PGO [1]. I'm not confident this will move pyperformance but I'll go benchmark anyways. My personal rule of thumb is we need a >20% speedup on microbenchmarks (depending on how common the opcode is) to affect pyperformance.

Additionally, many of the specializations for CALL_FUNCTION can be easily refactored and reused for CALL_METHOD. So I may try that and see if we can move pyperformance.

[1]

"python.exe"  -m timeit -s "x = lambda a:a" "x(1)"

Main:
5000000 loops, best of 5: 56.1 nsec per loop

Specialized:
5000000 loops, best of 5: 50.9 nsec per loop

Very rough WIP commit. I'm probably not optimizing very well Fidget-Spinner/cpython@d8f3330

@markshannon
Copy link
Member Author

My somewhat informal analysis of specializing CALL_FUNCTION is that calls are >90% (probably > 95%) are one of:

  • Python functions
  • Builtin functions
  • Classes (both instantiations and calls to type and str)
    With the first two dominating.

Specializing python functions without builtin functions, or vice versa, will pay a relatively large cost for CALL_FUNCTION_ADAPTIVE. Specializing both should show a speedup.

@Fidget-Spinner I'm going to implement CALL_FUNCTION_PY_SIMPLE as there are some nice optimizations we can make by overlapping frames. Hopefully it will show a speed up, and then we can merge python/cpython#26934.

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Jul 30, 2021

@Fidget-Spinner I'm going to implement CALL_FUNCTION_PY_SIMPLE as there are some nice optimizations we can make by overlapping frames. Hopefully it will show a speed up, and then we can merge python/cpython#26934.

@markshannon alright. Really looking forward to that :). Btw, I noticed that quite a few CALL_FUNCTION calls in pyperformance are to PyMethodObject s too. Simple method calls can probably reuse code from CALL_FUNCTION_PY_SIMPLE.

@markshannon
Copy link
Member Author

Here are the numbers from running the benchmark suite. (May include some of the machinery, not just the benchmarks)

2 billion CALL_FUNCTION or quickened variants.

32.5M specialization attempts
Breakdown:

Type attempts percentage
Builtin function 20.1M 61.9%
Class 5.95M 18.4%
Python function 4.8M 14.9%
Bound method 1.2M 3.7%
Other 0.3M 1.1%

Which explains why specializing Python calls doesn't have as much effect as we might expect.

@markshannon
Copy link
Member Author

Further breaking down the builtin function percentages:

Flags percentage
METH_O 30.2%
METH_FASTCALL 26.6%
METH_FASTCALL | METH_KEYWORDS 2.3%
METH_VARARGS (inc. METH_KEYWORDS) 2.0%
METH_NOARGS 0.8%

@gvanrossum
Copy link
Collaborator

Very cool to know. Next question, what’s the breakdown per benchmark?

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Aug 3, 2021

2 billion CALL_FUNCTION or quickened variants.
32.5M specialization attempts

Thanks for crunching the numbers Mark! Just to clarify: those are specialization attempts, not runtime execution counts right?

I suspect while the number of specialization sites for python functions and bound methods are low, the frequency of instruction execution is high. Specialization attempts are sometimes deceptive -- e.g. my experiments in METH_FASTCALL showed that at specialization time, ~25% had 2 or 3 arguments, but at instruction execution time, that was reduced to ~10% (python/cpython#26934 (comment))

@markshannon
Copy link
Member Author

The 2 billion is the execution count. The rest are specialization attempts; multiply by ~64 to get an estimate of execution counts.

I don't see how specialization numbers would that different, or even differ by more than a small fraction.
Without any actual specialization, the specialization attempts are just samples of execution.

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Aug 4, 2021

Thanks for clearing my misunderstandings Mark!

I don't have runtime specialization data for the following, but this should be useful to know which benchmarks to expect some speedup. From dxpairs data in the faster-cpython/tools repo,

Top benchmarks for CALL_FUNCTION (truncated <5%):

 bm_mako.json                      616,984   9.56%
 bm_django_template.json         1,016,938   6.09%
 mypy.json                      68,488,372   6.06%
 bm_regex_effbot.json               50,195   5.69%
 bm_chameleon.json                 505,263   5.69%
 bm_xml_etree.json               4,702,957   5.60%

CALL_METHOD (truncated <5%):

bm_mako.json                      468,476   7.26%
bm_deltablue.json                 186,465   6.69%
bm_richards.json                1,465,382   5.25%

Sanity check with BINARY_SUBSCR, since that was a successful specialization:

bm_crypto_pyaes.json            3,723,498  12.14%
bm_nqueens.json                 3,553,021   9.18%
bm_fannkuch.json               17,918,473   8.16%
bm_scimark.json                23,790,519   6.90%
bm_hexiom.json                    274,217   5.75%
bm_nbody.json                   4,509,873   5.59%

The BINARY_SUBSCR stats roughly corroborate with the benchmarks which sped up.

I'll be very shocked if specializing CALL_FUNCTION doesn't leave a dent in benchmarks. Especially considering calls are so much slower than BINARY_SUBSCR, which showed benchmark speedups with similar percentages. Mark, if you don't mind, could you please point me to your branch for CALL_FUNCTION_PY_SIMPLE with overlapping frames? I'd like to try collect more info.

@markshannon
Copy link
Member Author

First, to summarize what I now believe the benchmarks and other evidence is telling us:

  • The vectorcall protocol for calling C functions is fast. Any specialization that adds overhead to that is likely to make things worse.
  • Specializing for specific cases (exactly one argument to METH_O, and maybe exactly two arguments to METH_FASTCALL) will produce a small speedup for those calls.
  • The ADAPTIVE form is not free, so unless we get high hit ratios the additional cost of the ADAPTIVE form will eat any gains made by specialization

While that might suggest that it isn't worth specializing CALL_FUNCTION, I still think that it is, but we will need to slice it a bit differently:

  1. Specialize for a few specific callables, e.g. type and len, that are internally simple. Removing the call overhead and inlining should give good speedups in those cases.
  2. Specialize for Python functions, because we will be unable to use vectorcall if we want to avoid making C calls for Python calls.
  3. Specialize for the one or two cases of builtin functions that show some speedup (METH_O and maybe METH_FASTCALL with 2 arguments)
  4. Specialize for vectorcall, just to avoid the ADAPTIVE overhead for most of the remaining cases.

There is a downside to specializing for vectorcall: we lose the adaptive nature of the specializing interpreter. Once we have specialized on the vectorcall call, we are unlikely to escape that state, because all the callables we want to specialize for implement vectorcall.

Because (3) above depends on removing C calls when making Python calls, we may need to wait until that is implemented until we get any worthwhile speedups for CALL_FUNCTION.

@markshannon
Copy link
Member Author

Latest data running pyperformance, using https://github.com/faster-cpython/cpython/tree/specialize-call-function-stats

2 billion CALL_FUNCTION or quickened variants.

31.7M specialization attempts
Breakdown:

Callee percentage
len 12.4%
isinstance 15.6%
Other C function METH_O 18.3%
C function METH_FAST (2 arguments) 6.5%
C function METH_FAST(other arguments) 5.1%
C function (all other forms) 4.7%
Python function (simple, matching arguments) 13.4%
Python function (other) 1.2%
type (one argument) 2.5%
other classes 15.4%
bound method 3.8%
other 1.1%

@Fidget-Spinner
Copy link
Collaborator

I've somewhat given up on this :(. I updated the PR with specialized bytecode just for len and isinstance, and even that showed 0 speedup in pyperformance:

https://gist.github.com/Fidget-Spinner/07d6a123102c9ff8819f52bb8f9b57a6

@gvanrossum
Copy link
Collaborator

Never mind the benchmarks, do you have statistics about what fraction of CALL_FUNCTION opcodes are actually specialized? As Mark explained, if too many calls go through the ADAPTIVE variant and end up being deoptimized, that's expensive. This may explain why just specializing len and isinstance doesn't do us much good.

In any case, @pablogsal is working on optimizing specifically Python-to-Python calls. I don't have a link to his code yet, and I don't know exactly what his approach is, but he said he had it working for a simple recursive factorial() implementation and reported that it gave great speedup for that (1.7x according to my notes!).

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Sep 22, 2021

Never mind the benchmarks, do you have statistics about what fraction of CALL_FUNCTION opcodes are actually specialized?

According to Mark's stats above for pyperformance, more than 50% should be specialized. Running a workload of python -m test test_typing test_re test_dis test_zlib gives me 40% (unless I misinterpreted the stats, it's 985215 / (710360 + 724450 + 985215)). Very few deoptimizations for function calls.

    call_function.specialization_success : 810
    call_function.specialization_failure : 11976
    call_function.hit : 985215
    call_function.deferred : 710360
    call_function.miss : 9
    call_function.deopt : 3
    call_function.unquickened : 724450
    call_function.specialization_failure_kinds[10] : 21
    call_function.specialization_failure_kinds[11] : 3119  # Bound method 
    call_function.specialization_failure_kinds[12] : 170
    call_function.specialization_failure_kinds[13] : 1475  # PyCFunction with keywords
    call_function.specialization_failure_kinds[14] : 176 
    call_function.specialization_failure_kinds[15] : 26
    call_function.specialization_failure_kinds[16] : 0
    call_function.specialization_failure_kinds[17] : 3131  # Python function
    call_function.specialization_failure_kinds[18] : 3837  # Immutable class call, eg X()

My own experiments suggest the adaptive opcode overhead is only significant for things with already low calling overhead. Eg. vectorcall C functions without keywords. Normal python functions and classes don't have any measurable slowdown even in microbenchmarks because they're comparatively very expensive.

In any case, Pablo is working on optimizing specifically Python-to-Python calls.

Yeap, he has a PR up at python/cpython#28488. I find his approach really elegant (wish I'd thought of it :P) . When CALL_FUNCTION detects a python function, it doesn't call _PyEval_Vector and eventually _PyEval_EvalFrameDefault. Instead, it sets up a new interpreter frame inline then jumps to the start of _PyEval_EvalFrameDefault.

@pablogsal
Copy link
Collaborator

pablogsal commented Sep 22, 2021

but he said he had it working for a simple recursive factorial() implementation and reported that it gave great speedup for that (1.7x according to my notes!).

Although I am very excited about it, as a disclaimer that was a quick micro-benchmark of the initial prototype (that still segfaulted with exceptions 🙂 ). I am running today the full benchmark suite with all optimizations. Although I don't expect that much I hope we get some good numbers 🤞

I find his approach really elegant

Well, is elegant when it works, but you cannot imagine the number of nightmarish bugs that we had to deal with meanwhile. Also, the original idea for this was originally suggested by @markshannon 😉 (several times, the first one in https://www.python.org/dev/peps/pep-0651) and we have been working together on designing this version of the code (I did the implementation), as the approach matters a lot.

@gvanrossum
Copy link
Collaborator

1.7x according to my notes

With the code fixed, and using PyPerformance with PGO-LTO, Pablo reported about 3% speed increase. So not quite exciting, but definitely significant (most of the changes we've made so far gain 1-3% in speed, with most around 1%).

@markshannon
Copy link
Member Author

python/cpython#28488 actually caused a tiny slowdown.
https://gist.github.com/markshannon/e278f135dca55611ca982e42780bb4fd

There's quite a lot of cleaning up to do. So I'm not worried by the temporary slowdown.

@pablogsal
Copy link
Collaborator

pablogsal commented Oct 11, 2021

python/cpython#28488 actually caused a tiny slowdown.
https://gist.github.com/markshannon/e278f135dca55611ca982e42780bb4fd

Hummm.... I am confused, this used to be much faster: python/cpython#28488 (comment) :(

Also, there is something weird there because in your benchmark it shows pydigits to be a lot slower:

pidigits | 320 ms | 357 ms: 1.12x slower

The benchmarks I ran were the first version that passed the test suite so maybe we did something in the review that made it slower (there were also some refleaks and we changed how we are handling the stealing of arguments, so this may have impacted the times).


Edit: I made another benchmark run and this is what I get:

https://gist.github.com/pablogsal/34b542cc7e8366bdcaa2c650c0542895

which is slower than the first version of the PR, but it doesn't have the extreme cases that you have in the other gists (pydigits and regex_effbot).

@markshannon markshannon self-assigned this Nov 8, 2021
@gramster gramster moved this to In Progress in Fancy CPython Board Jan 10, 2022
@markshannon markshannon moved this from In Progress to In Review in Fancy CPython Board Jan 25, 2022
@markshannon
Copy link
Member Author

python/cpython#30855 allows us to specialize almost all calls, so we can then consider this "implemented".
It isn't finished, but with the possible exception of LOAD_GLOBAL, none of the specialization are.
The bulk of the work is done, allowing small incremental improvements from here on.

@markshannon markshannon moved this from In Review to Done in Fancy CPython Board Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants