-
Notifications
You must be signed in to change notification settings - Fork 51
Adaptive, specializing interpreter #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
How do the existing inline cache data structures compare to your implementation options? Also, for the LOAD_GLOBAL example specifically, IIRC Dino's Shadowcode (#3) has some trick where a single version check validates the version of both dicts (globals and builtins). |
I'd describe the current cache as:
AFAICT it checks both versions: |
This is a version without dictionary watchers, if you look at the version in Cinder we actually do LOAD_GLOBAL with a single load with an indirection and no version checks: https://github.com/facebookincubator/cinder/blob/cinder/3.8/Python/ceval.c#L4824 |
To be pedantic, |
Once python/cpython#26264 is merged we should implement the following optimization "families" (in approximate order of importance):
|
See https://bugs.python.org/issue38278 regarding LOAD_METHOD. |
The framework is all implemented. I think it is better to track individual families in their own issues. |
This explains the adaptive, specializing interpreter mentioned as the core plan of stage 1 in https://github.com/markshannon/faster-cpython/blob/master/plan.md
This requires and builds on the quickening interpreter #27.
Overview
We exploit the capability of the quickening interpreter to support self-modifying code to add specialization "families" of bytecodes for those bytecodes which would gain most from specialization. The bytecodes within a "family" can replace themselves with other members of the family at runtime. This allows us to perform quite aggressive speculative specialization, since de-optimization is cheap.
For each "plain" bytecode that would benefit from specialization we add an "adaptive" bytecode and several specialized forms.
During the quickening phase the "plain" bytecodes are replaced with the "adaptive" bytecode. At runtime the "adaptive" bytecode maintains a counter and when that counter reaches zero it attempts to specialize itself. Otherwise it jumps to the "plain" bytecode to perform the operation. Should specialization fail, the counter is set to some appropriate number, say 20.
The adaptive forms are coded roughly as:
The specialized forms are coded roughly as:
Equivalence to specializing, optimizing VM where the region of optimization is one bytecode
Specializing, optimizing VMs work by selecting regions of code to optimize and compile.
V8 optimizes whole functions. PyPy and luajit optimize loopy traces. HHVM optimizes tracelets, then mutliple tracelets stitched together.
By specializing just one bytecode we do not need JIT compilation. We can ahead-of-time compile the specializations by implementing them in C. Obviously this will not perform anything like as well as using larger regions, but we can still achieve worthwhile speedups.
Example
Using
LOAD_GLOBAL
as an example, we add three new bytecodes:LOAD_GLOBAL_ADAPTIVE
,LOAD_GLOBAL_MODULE
,LOAD_GLOBAL_BUILTIN
.LOAD_GLOBAL_ADAPTIVE
is implemented as above.LOAD_GLOBAL_MODULE
is a specialization for when the value is stored in the module's dictionary.LOAD_GLOBAL_BUILTIN
is a specialization for when the value is stored in the builtin's dictionary.Specialization of
LOAD_GLOBAL_ADAPTIVE
determines whether the value is in the module's dictionary of builtin's, records the expected version numbers and the expected value, then converts the bytecode toLOAD_GLOBAL_MODULE
orLOAD_GLOBAL_BUILTIN
.Note:
Implementation
Alongside the array of quickened instructions we need a supporting data structure holding counts, expected values and versions,
original operands, and cached values.
During quickening, memory for the specialized instructions must be allocated in such a way that it can be accessed quickly during execution.
Possible implementations
Entries:
Indexing:
Layout:
Preferred implementation
The best (fastest in general and not the simplest) approach is to allow variable sized entries, indexed by a function of offset and oparg and use back-to-back arrays.
To get the data for a bytecode would be something like:
where
nexti
is the index of the next instruction and assuming that, on average, each instruction needs half a data item (approx 25% instructions need data, and that they need 2 items on average).Indexing 3 should be faster than using indexing 2 when combined with back-to-back arrays and variable sized entries, because it reduces register use, and is more compact. The additional ALU operation(s) required shouldn't matter too much.
On a 64bit machine, a
DataItem
would be 8 bytes.The text was updated successfully, but these errors were encountered: