Skip to content

Adaptive, specializing interpreter #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
markshannon opened this issue Mar 24, 2021 · 8 comments
Closed

Adaptive, specializing interpreter #28

markshannon opened this issue Mar 24, 2021 · 8 comments

Comments

@markshannon
Copy link
Member

This explains the adaptive, specializing interpreter mentioned as the core plan of stage 1 in https://github.com/markshannon/faster-cpython/blob/master/plan.md
This requires and builds on the quickening interpreter #27.

Overview

We exploit the capability of the quickening interpreter to support self-modifying code to add specialization "families" of bytecodes for those bytecodes which would gain most from specialization. The bytecodes within a "family" can replace themselves with other members of the family at runtime. This allows us to perform quite aggressive speculative specialization, since de-optimization is cheap.

For each "plain" bytecode that would benefit from specialization we add an "adaptive" bytecode and several specialized forms.
During the quickening phase the "plain" bytecodes are replaced with the "adaptive" bytecode. At runtime the "adaptive" bytecode maintains a counter and when that counter reaches zero it attempts to specialize itself. Otherwise it jumps to the "plain" bytecode to perform the operation. Should specialization fail, the counter is set to some appropriate number, say 20.

The adaptive forms are coded roughly as:

if counter == 0:
    nexti -= 1 # Dispatch will jump to specialized version
    specialize(instructions[nexti])
else:
   counter -= 1
   goto PLAIN_BYTECODE

The specialized forms are coded roughly as:

if predicate(obj):
     saturating_increment(counter)
     do_specialized_operation()
else:
     saturating_decrement(counter)
     if counter == 0:
         replace_self_with_adaptive_form()
     goto PLAIN_BYTECODE

Equivalence to specializing, optimizing VM where the region of optimization is one bytecode

Specializing, optimizing VMs work by selecting regions of code to optimize and compile.
V8 optimizes whole functions. PyPy and luajit optimize loopy traces. HHVM optimizes tracelets, then mutliple tracelets stitched together.

By specializing just one bytecode we do not need JIT compilation. We can ahead-of-time compile the specializations by implementing them in C. Obviously this will not perform anything like as well as using larger regions, but we can still achieve worthwhile speedups.

Example

Using LOAD_GLOBAL as an example, we add three new bytecodes: LOAD_GLOBAL_ADAPTIVE, LOAD_GLOBAL_MODULE, LOAD_GLOBAL_BUILTIN.

LOAD_GLOBAL_ADAPTIVE is implemented as above.

LOAD_GLOBAL_MODULE is a specialization for when the value is stored in the module's dictionary.

    if (UNLIKELY(globals->version != EXPECTED_VERSION)) {
         goto maybe_deopt_load_global_module; // Handles counter, etc.
    }
    saturating_increment(counter);
    INCREF(cached_value);
    PUSH(cached_value);

LOAD_GLOBAL_BUILTIN is a specialization for when the value is stored in the builtin's dictionary.

    if (UNLIKELY(globals->version != EXPECTED_VERSION1)) {
         goto maybe_deopt_load_global_builtin; // Handles counter, etc.
    }
    if (UNLIKELY(builtins->version != EXPECTED_VERSION2)) {
        goto maybe_deopt_load_global_builtin;
    }
    saturating_increment(counter);
    INCREF(cached_value);
    PUSH(cached_value);

Specialization of LOAD_GLOBAL_ADAPTIVE determines whether the value is in the module's dictionary of builtin's, records the expected version numbers and the expected value, then converts the bytecode to LOAD_GLOBAL_MODULE or LOAD_GLOBAL_BUILTIN.

Note:

The above uses dictionary versions. It is possible to use the dictionary keys to allow a specialization that works when a global variable is present, but that is out of scope for this issue.

Implementation

Alongside the array of quickened instructions we need a supporting data structure holding counts, expected values and versions,
original operands, and cached values.
During quickening, memory for the specialized instructions must be allocated in such a way that it can be accessed quickly during execution.

Possible implementations

Entries:

  1. Fixed size entries, all bytecodes use the same sized data structure. This is inflexible but simple.
  2. Variable sized entries, flexible but more complex. Note that entries can vary in size across families, not within them. The size of the data structure is fixed for a given family of bytecodes.

Indexing:

  1. Index by external array. Maps instruction offset to offset in data array. Slow as it requires extra indirection.
  2. Index stores as oparg. Fast, but limits number of entries to 256, or fewer if variable sized entries are used.
  3. Index as function of instruction offset and oparg. A bit slower that 2, but allows near infinite number of entries.

Layout:

  1. Two arrays. Requires two registers for indexing implementation 2, but three registers for indexing implementation 3.
  2. Back-to-back array. Instruction go forwards in memory from a shared pointer, data goes backwards. Requires two for indexing implementation 3.

Preferred implementation

The best (fastest in general and not the simplest) approach is to allow variable sized entries, indexed by a function of offset and oparg and use back-to-back arrays.
To get the data for a bytecode would be something like:

DataItem *data = &((DataItem *)instruction_base)[-(nexti>>1)-oparg]

where nexti is the index of the next instruction and assuming that, on average, each instruction needs half a data item (approx 25% instructions need data, and that they need 2 items on average).

Indexing 3 should be faster than using indexing 2 when combined with back-to-back arrays and variable sized entries, because it reduces register use, and is more compact. The additional ALU operation(s) required shouldn't matter too much.

On a 64bit machine, a DataItem would be 8 bytes.

@gvanrossum
Copy link
Collaborator

How do the existing inline cache data structures compare to your implementation options?

Also, for the LOAD_GLOBAL example specifically, IIRC Dino's Shadowcode (#3) has some trick where a single version check validates the version of both dicts (globals and builtins).

@markshannon
Copy link
Member Author

markshannon commented Mar 25, 2021

How do the existing inline cache data structures compare to your implementation options?

I'd describe the current cache as:
Fixed sized entries; Index by external array; Two arrays (three if you include the array of indices).

Also, for the LOAD_GLOBAL example specifically, IIRC Dino's Shadowcode (#3) has some trick where a single version check validates the version of both dicts (globals and builtins).

AFAICT it checks both versions:
https://github.com/DinoV/cpython/blob/9b93db97597a2142bd3488369f159ac3624a9d42/Python/ceval.c#L3745

@DinoV
Copy link

DinoV commented May 13, 2021

AFAICT it checks both versions:
https://github.com/DinoV/cpython/blob/9b93db97597a2142bd3488369f159ac3624a9d42/Python/ceval.c#L3745

This is a version without dictionary watchers, if you look at the version in Cinder we actually do LOAD_GLOBAL with a single load with an indirection and no version checks: https://github.com/facebookincubator/cinder/blob/cinder/3.8/Python/ceval.c#L4824

@markshannon
Copy link
Member Author

To be pedantic, PyObject *v = *global_cache[(unsigned int)oparg]; is two (dependent) loads.

@markshannon
Copy link
Member Author

Once python/cpython#26264 is merged we should implement the following optimization "families" (in approximate order of importance):

  • LOAD_GLOBAL (needs porting from old opcache)
  • LOAD_ATTR (needs porting from old opcache)
  • STORE_ATTR
  • CALL_FUNCTION
  • LOAD_METHOD
  • CALL_METHOD
  • BINARY_SUBSCR
  • BINARY_ADD
  • BINARY_MULTIPLY

@markshannon
Copy link
Member Author

LOAD_GLOBAL: #51
LOAD_ATTR: #52
STORE_ATTR: #53
CALL_FUNCTION: #54

@iritkatriel
Copy link
Collaborator

See https://bugs.python.org/issue38278 regarding LOAD_METHOD.

@markshannon
Copy link
Member Author

The framework is all implemented. I think it is better to track individual families in their own issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants