Adaptive, specializing interpreter #28

markshannon · 2021-03-24T16:03:47Z

This explains the adaptive, specializing interpreter mentioned as the core plan of stage 1 in https://github.com/markshannon/faster-cpython/blob/master/plan.md
This requires and builds on the quickening interpreter #27.

Overview

We exploit the capability of the quickening interpreter to support self-modifying code to add specialization "families" of bytecodes for those bytecodes which would gain most from specialization. The bytecodes within a "family" can replace themselves with other members of the family at runtime. This allows us to perform quite aggressive speculative specialization, since de-optimization is cheap.

For each "plain" bytecode that would benefit from specialization we add an "adaptive" bytecode and several specialized forms.
During the quickening phase the "plain" bytecodes are replaced with the "adaptive" bytecode. At runtime the "adaptive" bytecode maintains a counter and when that counter reaches zero it attempts to specialize itself. Otherwise it jumps to the "plain" bytecode to perform the operation. Should specialization fail, the counter is set to some appropriate number, say 20.

The adaptive forms are coded roughly as:

if counter == 0:
    nexti -= 1 # Dispatch will jump to specialized version
    specialize(instructions[nexti])
else:
   counter -= 1
   goto PLAIN_BYTECODE

The specialized forms are coded roughly as:

if predicate(obj):
     saturating_increment(counter)
     do_specialized_operation()
else:
     saturating_decrement(counter)
     if counter == 0:
         replace_self_with_adaptive_form()
     goto PLAIN_BYTECODE

Equivalence to specializing, optimizing VM where the region of optimization is one bytecode

Specializing, optimizing VMs work by selecting regions of code to optimize and compile.
V8 optimizes whole functions. PyPy and luajit optimize loopy traces. HHVM optimizes tracelets, then mutliple tracelets stitched together.

By specializing just one bytecode we do not need JIT compilation. We can ahead-of-time compile the specializations by implementing them in C. Obviously this will not perform anything like as well as using larger regions, but we can still achieve worthwhile speedups.

Example

Using LOAD_GLOBAL as an example, we add three new bytecodes: LOAD_GLOBAL_ADAPTIVE, LOAD_GLOBAL_MODULE, LOAD_GLOBAL_BUILTIN.

LOAD_GLOBAL_ADAPTIVE is implemented as above.

LOAD_GLOBAL_MODULE is a specialization for when the value is stored in the module's dictionary.

    if (UNLIKELY(globals->version != EXPECTED_VERSION)) {
         goto maybe_deopt_load_global_module; // Handles counter, etc.
    }
    saturating_increment(counter);
    INCREF(cached_value);
    PUSH(cached_value);

LOAD_GLOBAL_BUILTIN is a specialization for when the value is stored in the builtin's dictionary.

    if (UNLIKELY(globals->version != EXPECTED_VERSION1)) {
         goto maybe_deopt_load_global_builtin; // Handles counter, etc.
    }
    if (UNLIKELY(builtins->version != EXPECTED_VERSION2)) {
        goto maybe_deopt_load_global_builtin;
    }
    saturating_increment(counter);
    INCREF(cached_value);
    PUSH(cached_value);

Specialization of LOAD_GLOBAL_ADAPTIVE determines whether the value is in the module's dictionary of builtin's, records the expected version numbers and the expected value, then converts the bytecode to LOAD_GLOBAL_MODULE or LOAD_GLOBAL_BUILTIN.

Note:

The above uses dictionary versions. It is possible to use the dictionary keys to allow a specialization that works when a global variable is present, but that is out of scope for this issue.

Implementation

Alongside the array of quickened instructions we need a supporting data structure holding counts, expected values and versions,
original operands, and cached values.
During quickening, memory for the specialized instructions must be allocated in such a way that it can be accessed quickly during execution.

Possible implementations

Entries:

Fixed size entries, all bytecodes use the same sized data structure. This is inflexible but simple.
Variable sized entries, flexible but more complex. Note that entries can vary in size across families, not within them. The size of the data structure is fixed for a given family of bytecodes.

Indexing:

Index by external array. Maps instruction offset to offset in data array. Slow as it requires extra indirection.
Index stores as oparg. Fast, but limits number of entries to 256, or fewer if variable sized entries are used.
Index as function of instruction offset and oparg. A bit slower that 2, but allows near infinite number of entries.

Layout:

Two arrays. Requires two registers for indexing implementation 2, but three registers for indexing implementation 3.
Back-to-back array. Instruction go forwards in memory from a shared pointer, data goes backwards. Requires two for indexing implementation 3.

Preferred implementation

The best (fastest in general and not the simplest) approach is to allow variable sized entries, indexed by a function of offset and oparg and use back-to-back arrays.
To get the data for a bytecode would be something like:

DataItem *data = &((DataItem *)instruction_base)[-(nexti>>1)-oparg]

where nexti is the index of the next instruction and assuming that, on average, each instruction needs half a data item (approx 25% instructions need data, and that they need 2 items on average).

Indexing 3 should be faster than using indexing 2 when combined with back-to-back arrays and variable sized entries, because it reduces register use, and is more compact. The additional ALU operation(s) required shouldn't matter too much.

On a 64bit machine, a DataItem would be 8 bytes.

The text was updated successfully, but these errors were encountered:

gvanrossum · 2021-03-24T17:49:04Z

How do the existing inline cache data structures compare to your implementation options?

Also, for the LOAD_GLOBAL example specifically, IIRC Dino's Shadowcode (#3) has some trick where a single version check validates the version of both dicts (globals and builtins).

markshannon · 2021-03-25T11:26:17Z

How do the existing inline cache data structures compare to your implementation options?

I'd describe the current cache as:
Fixed sized entries; Index by external array; Two arrays (three if you include the array of indices).

Also, for the LOAD_GLOBAL example specifically, IIRC Dino's Shadowcode (#3) has some trick where a single version check validates the version of both dicts (globals and builtins).

AFAICT it checks both versions:
https://github.com/DinoV/cpython/blob/9b93db97597a2142bd3488369f159ac3624a9d42/Python/ceval.c#L3745

DinoV · 2021-05-13T05:25:02Z

AFAICT it checks both versions:
https://github.com/DinoV/cpython/blob/9b93db97597a2142bd3488369f159ac3624a9d42/Python/ceval.c#L3745

This is a version without dictionary watchers, if you look at the version in Cinder we actually do LOAD_GLOBAL with a single load with an indirection and no version checks: https://github.com/facebookincubator/cinder/blob/cinder/3.8/Python/ceval.c#L4824

markshannon · 2021-05-21T10:52:12Z

To be pedantic, PyObject *v = *global_cache[(unsigned int)oparg]; is two (dependent) loads.

markshannon · 2021-05-21T11:00:07Z

Once python/cpython#26264 is merged we should implement the following optimization "families" (in approximate order of importance):

LOAD_GLOBAL (needs porting from old opcache)
LOAD_ATTR (needs porting from old opcache)
STORE_ATTR
CALL_FUNCTION
LOAD_METHOD
CALL_METHOD
BINARY_SUBSCR
BINARY_ADD
BINARY_MULTIPLY

markshannon · 2021-05-21T17:35:22Z

LOAD_GLOBAL: #51
LOAD_ATTR: #52
STORE_ATTR: #53
CALL_FUNCTION: #54

iritkatriel · 2021-07-15T14:06:55Z

See https://bugs.python.org/issue38278 regarding LOAD_METHOD.

markshannon · 2021-07-16T16:01:16Z

The framework is all implemented. I think it is better to track individual families in their own issues.

This was referenced Mar 25, 2021

Move the data and control (block) stacks on the thread and remove frame objects. #31

Closed

Specialized eval loops for categories of functions #17

Closed

gvanrossum mentioned this issue Apr 5, 2021

Plan for the coming 6-12 months and parallel workstreams #40

Closed

markshannon closed this as completed Jul 16, 2021

gramster added this to Fancy CPython Board Jan 10, 2022

gramster moved this to Todo in Fancy CPython Board Jan 10, 2022

gramster moved this from Todo to Other in Fancy CPython Board Jan 10, 2022

gramster removed this from Fancy CPython Board Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive, specializing interpreter #28

Adaptive, specializing interpreter #28

markshannon commented Mar 24, 2021

gvanrossum commented Mar 24, 2021

markshannon commented Mar 25, 2021 •

edited

Loading

DinoV commented May 13, 2021

markshannon commented May 21, 2021

markshannon commented May 21, 2021

markshannon commented May 21, 2021

iritkatriel commented Jul 15, 2021

markshannon commented Jul 16, 2021

Adaptive, specializing interpreter #28

Adaptive, specializing interpreter #28

Comments

markshannon commented Mar 24, 2021

Overview

Equivalence to specializing, optimizing VM where the region of optimization is one bytecode

Example

Implementation

Possible implementations

Preferred implementation

gvanrossum commented Mar 24, 2021

markshannon commented Mar 25, 2021 • edited Loading

DinoV commented May 13, 2021

markshannon commented May 21, 2021

markshannon commented May 21, 2021

markshannon commented May 21, 2021

iritkatriel commented Jul 15, 2021

markshannon commented Jul 16, 2021

markshannon commented Mar 25, 2021 •

edited

Loading