-
Notifications
You must be signed in to change notification settings - Fork 51
Implement CALL_FUNCTION specialization "family" for PEP 659 adaptive interpreter #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Mark, I tried to cover these three with a single I wonder if my approach is the right one? Would you like me to create a PR against CPython? Edit: |
My experimental implementation of specialization of
I didn't do a very thorough analysis of the benefits of the above specializations. |
The very first iteration of my PR also had the equivalent of |
Another thing to consider is specializing for calls to particular builtin objects, e.g. |
Mark, I've recently taken a stab at simple python functions. Can I hear your thoughts on how you intend to optimize this scenario?
I see some micro-optimizations possible in |
Some updates: Simple python function specialization ( Additionally, many of the specializations for [1]
Very rough WIP commit. I'm probably not optimizing very well Fidget-Spinner/cpython@d8f3330 |
My somewhat informal analysis of specializing
Specializing python functions without builtin functions, or vice versa, will pay a relatively large cost for @Fidget-Spinner I'm going to implement |
@markshannon alright. Really looking forward to that :). Btw, I noticed that quite a few |
Here are the numbers from running the benchmark suite. (May include some of the machinery, not just the benchmarks) 2 billion 32.5M specialization attempts
Which explains why specializing Python calls doesn't have as much effect as we might expect. |
Further breaking down the builtin function percentages:
|
Very cool to know. Next question, what’s the breakdown per benchmark? |
Thanks for crunching the numbers Mark! Just to clarify: those are specialization attempts, not runtime execution counts right? I suspect while the number of specialization sites for python functions and bound methods are low, the frequency of instruction execution is high. Specialization attempts are sometimes deceptive -- e.g. my experiments in METH_FASTCALL showed that at specialization time, ~25% had 2 or 3 arguments, but at instruction execution time, that was reduced to ~10% (python/cpython#26934 (comment)) |
The 2 billion is the execution count. The rest are specialization attempts; multiply by ~64 to get an estimate of execution counts. I don't see how specialization numbers would that different, or even differ by more than a small fraction. |
Thanks for clearing my misunderstandings Mark! I don't have runtime specialization data for the following, but this should be useful to know which benchmarks to expect some speedup. From dxpairs data in the faster-cpython/tools repo, Top benchmarks for
Sanity check with
The I'll be very shocked if specializing |
First, to summarize what I now believe the benchmarks and other evidence is telling us:
While that might suggest that it isn't worth specializing
There is a downside to specializing for vectorcall: we lose the adaptive nature of the specializing interpreter. Once we have specialized on the vectorcall call, we are unlikely to escape that state, because all the callables we want to specialize for implement vectorcall. Because (3) above depends on removing C calls when making Python calls, we may need to wait until that is implemented until we get any worthwhile speedups for |
Latest data running pyperformance, using https://github.com/faster-cpython/cpython/tree/specialize-call-function-stats 2 billion 31.7M specialization attempts
|
I've somewhat given up on this :(. I updated the PR with specialized bytecode just for https://gist.github.com/Fidget-Spinner/07d6a123102c9ff8819f52bb8f9b57a6 |
Never mind the benchmarks, do you have statistics about what fraction of CALL_FUNCTION opcodes are actually specialized? As Mark explained, if too many calls go through the ADAPTIVE variant and end up being deoptimized, that's expensive. This may explain why just specializing len and isinstance doesn't do us much good. In any case, @pablogsal is working on optimizing specifically Python-to-Python calls. I don't have a link to his code yet, and I don't know exactly what his approach is, but he said he had it working for a simple recursive factorial() implementation and reported that it gave great speedup for that (1.7x according to my notes!). |
According to Mark's stats above for pyperformance, more than 50% should be specialized. Running a workload of
My own experiments suggest the adaptive opcode overhead is only significant for things with already low calling overhead. Eg. vectorcall C functions without keywords. Normal python functions and classes don't have any measurable slowdown even in microbenchmarks because they're comparatively very expensive.
Yeap, he has a PR up at python/cpython#28488. I find his approach really elegant (wish I'd thought of it :P) . When |
Although I am very excited about it, as a disclaimer that was a quick micro-benchmark of the initial prototype (that still segfaulted with exceptions 🙂 ). I am running today the full benchmark suite with all optimizations. Although I don't expect that much I hope we get some good numbers 🤞
Well, is elegant when it works, but you cannot imagine the number of nightmarish bugs that we had to deal with meanwhile. Also, the original idea for this was originally suggested by @markshannon 😉 (several times, the first one in https://www.python.org/dev/peps/pep-0651) and we have been working together on designing this version of the code (I did the implementation), as the approach matters a lot. |
With the code fixed, and using PyPerformance with PGO-LTO, Pablo reported about 3% speed increase. So not quite exciting, but definitely significant (most of the changes we've made so far gain 1-3% in speed, with most around 1%). |
python/cpython#28488 actually caused a tiny slowdown. There's quite a lot of cleaning up to do. So I'm not worried by the temporary slowdown. |
Hummm.... I am confused, this used to be much faster: python/cpython#28488 (comment) :( Also, there is something weird there because in your benchmark it shows
The benchmarks I ran were the first version that passed the test suite so maybe we did something in the review that made it slower (there were also some refleaks and we changed how we are handling the stealing of arguments, so this may have impacted the times). Edit: I made another benchmark run and this is what I get: https://gist.github.com/pablogsal/34b542cc7e8366bdcaa2c650c0542895 which is slower than the first version of the PR, but it doesn't have the extreme cases that you have in the other gists ( |
python/cpython#30855 allows us to specialize almost all calls, so we can then consider this "implemented". |
There are a number of specializations of
CALL_FUNCTION
that make sense:len
andinstance
.type
with a single argument.__new__
and whose metaclass istype
.The text was updated successfully, but these errors were encountered: