Move the data and control (block) stacks on the thread and remove frame objects. #31

markshannon · 2021-03-25T16:13:46Z

Python, like almost all other languages, operates on a stack of call frames.
Most languages use a continuous stack for calls because it is much more efficient.
Obviously, C and all languages that are pre-compiled use the OS/hardware provided stack.
But even interpreted languages use a continuous stack. Java and Lua are obvious examples.

Python should do the same. Allocating frames on the heap, "zombie" frames, and the excessive copying of arguments are all slow and unnecessary.

Implementation

Each thread needs two stacks, a data stack and a control stack. These can be continuous or chunked. Chunked gives us most of the performance advantages of continuous and is more flexible, but is more complex to implement.

The data stack is a (conceptually) infinite stack of PyObject *s (or PyValues with tagging).
The control stack is (conceptually) infinite stack of ControlBlocks.

To efficiently implement overflow checks, stacks should be power-of-2 sized and aligned on their size.

For performance it is probably a good idea to arrange that the control stack cannot overflow unless the data stack does first. That way there is no need to check it for overflow. This can be done by choosing some ratio R and ensuring that for all code-objects (locals + 3) >= (1+block_stack_size)*R, which can be done by inflating the number of locals if necessary, and that the data stack has no more than R times as many entries as the control stack. 4 is probably a good value for R.

ControlBlocks

A control block will be mostly one of:

Return block, contains IP of caller.
Try block, contains handler and stack-depth

Additional types of ControlBlocks can be used for transfer in and out of generators, exits from the interepreter at the C level, and other flow control tasks.

Frame layout on data stack

Each frame consists of local variables (including cells) followed by globals, builtins, and locals, followed by the evaluation stack.

Python-to-python calls

Assuming that calls are effectively specialized (by #28) then making a call will require the following operations:

Make the frame-pointer point to the first argument.
Shuffle the arguments into the new locals and fill in the defaults. For many calls this will be a no-op as the arguments will match the parameters.
Increase the stack-pointer to point just past the new locals.

Generators and coroutines

Generators will need to contain space for their own local variables and control stack (for exceptions, not calls).
The compiler will need to be modified to:

Emit special opcodes for local loads and stores.
Make sure that the evaluation stack is empty across yields, by saving iterators and other temporaries into local variables.

Frame objects

Frame objects are widely used for debugging and introspection. So we must support them in a reasonably efficient fashion.

Upon making a call, we push a ControlBlock. This "call" block will contain a pointer (initially NULL) pointing to the frame object. Should we ever need a frame object, say from sys._getframe(), we lazily create one that points back to the control block for that frame. On exiting the function, the frame can be discarded or, if still live, the locals can be copied into the frame.

Implementing `sys._getframe()`

To find the nth frame we walk the control stack until we find the nth "call" block, then read the frame-object from that.
If it is NULL, we create a new one and store it into the control block.

Example control blocks:

"Call" block:

int frame_pointer_offset
int instruction_offset
PyObject *frame_object (usually NULL)

"Generator" block:

int yield_pointer_offset -- Where to jump to on yield
int return_pointer_offset -- Where to jump to on return

"Try" block:

int handler_offset
int stack_depth

The text was updated successfully, but these errors were encountered:

ericsnowcurrently · 2021-03-26T18:43:05Z

This is definitely an interesting idea. I expect that this would result in significant churn in ceval.c but, assuming meaningful performance improvement and/or maintainability, that would be justifiable.

Here are some questions:

what sort of performance improvement would you expect?
would this mean that (non-arg) locals would be allocated out of the data stack rather than the current allocators? (presumably they would not participate in GC)
would the updated ceval.c be easier or harder to understand (and maintain)?
- how hard would it be for a new reader of the code to reach a basic working knowledge?
- what parts become simpler?
- what parts become more complex?
what improvements does this allow us to pursue that the current approach prevents?
likewise, what does this prevent that the current approach allows?

Also, some bonus observations:

the proposed approach could be a big win if the cost of func call cleanup drops down to the nominal range for the other sections of code related to Python calls
- (via some naive profiling I did, relative to _PyEval_EvalFrameDefault(), where the eval loop lives)
- the setup code before entering the eval loop, as well as the code at the top and bottom of the loop are generally faster than the fastest opcode cases (e.g. ~1µs)
- the cleanup code after the loop exits is generally slower, sometimes massively, with a wide range (e.g. 2µs - 1200µs)
this implies that PyInterpreterState.eval_frame (from PEP 523) will be eliminated (or perhaps it would trigger a branch somewhere that preserves the current behavior)

gvanrossum · 2021-03-26T23:05:36Z

1 us seems a large number. Presumably the variability in cleanup time is due to object finalizations due to DECREF.

markshannon · 2021-04-09T14:28:37Z

I think this is partly achievable for 3.10. See #43 for the more limited form.

markshannon · 2021-05-19T16:23:45Z

Here's a draft write up describing the activation record stack, without frame objects:

Activation Records

Each Python function activation has an activation record.
Most activation records form a contiguous per-thread stack, but generators and coroutines have heap allocated activation records.

Layout

Each activation record consists of four sections:

Local variables (including arguments, cells and free variables)
Specials: the globals dict, builtins dicts, "slow" locals dict, if any, and the code object.
Linkage: Includes pointers to previous activation record.
Evaluation stack

The linkage and specials sections are the same size for all frames, which means that a pointer to the specials, linkage, or base of the stack are equivalent, as each can be computed from the others by adding a small constant.

There are two options for layout. Option A allows opverlapping frames and minimal movement
of values in a call. Option B minimizes the amount of linkage required.

Option A:

Local variables
Linkage
Specials
Evaluation stack

This layout requires three pointer to efficiently access:

Frame pointer -- Points to local[0]
Stack base pointer -- Used to access specials, linkage.
Stack pointer -- Points to top of evaluation stack.

Option B:

Linkage
Specials
Local variables
Evaluation stack

This layout only requires two pointers:

Frame pointer -- Points to local[0], also allows access to specials and linkage.
Stack pointer -- Points to top of evaluation stack.

Computing the base of the evaluation stack becomes a bit more expensive, but not that much.

Linkage section

The linkage section contains the above pointers for the caller's activation,
plus the return address, i.e. instruction pointer/offset of the caller.

A pointer to the specials/linkage of the current activation will be stored in the
current CFrame. Doing so allows the full stack of activation records to be traversed from
the thread state object at minimal runtime cost to maintain the link.

Note:

In a contiguous stack, we would need to save one fewer registers, as the top of the caller's
activation record would be the same at the base of the callee's. However, since some activation
records are kept on the heap we cannot do this.

Specials section

The specials sections contains the following pointers:

Globals dict
Builtins dict
Locals dict (not the "fast" locals, but the locals for eval and class creation)
Code object
Heap allocated FrameObject for this activation record, if any.

Frame objects

To enable introspection, a heap allocated FrameObject is lazily created when the a stack frame is introspected, or a traceback is created.
These FrameObject will reference the activation record, and when that activation record goes
out of scope, will copy it to the heap.

This gives the appearance of persistent, heap-allocated activation records with minimal runtime overhead

PEP 523 -- Adding a frame evaluation API to CPython

PEP 523 requires that a FrameObject is created for every call to a Python function.
Should the eval_frame of the current interpreter be anything but the default,
a FrameObject will be created and the PEP 523 speciifed API be used.

gvanrossum · 2021-05-19T19:12:34Z

Computing the base of the stack becomes a bit more expensive, but not that much.

Why would you even need to know the base of the stack? Just to detect underflow?

markshannon · 2021-05-27T09:31:15Z

That should have been "evaluation stack"

markshannon · 2021-07-20T11:41:23Z

python/cpython#27077

markshannon · 2021-07-26T17:10:19Z

Done 🎉

gvanrossum mentioned this issue Apr 5, 2021

Plan for the coming 6-12 months and parallel workstreams #40

Closed

markshannon mentioned this issue Apr 9, 2021

Move the data and control (block) stacks to thread and generator/coroutines #43

Closed

markshannon mentioned this issue Apr 15, 2021

"Zero overhead" exception handling #45

Closed

markshannon closed this as completed Jul 26, 2021

gramster added this to Fancy CPython Board Jan 10, 2022

gramster moved this from Todo to Done in Fancy CPython Board Jan 10, 2022

gramster moved this to Todo in Fancy CPython Board Jan 10, 2022

markshannon mentioned this issue Sep 22, 2022

Reducing stack consumption through overlapping frames #471

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move the data and control (block) stacks on the thread and remove frame objects. #31

Move the data and control (block) stacks on the thread and remove frame objects. #31

markshannon commented Mar 25, 2021

ericsnowcurrently commented Mar 26, 2021

gvanrossum commented Mar 26, 2021 via email

markshannon commented Apr 9, 2021

markshannon commented May 19, 2021 •

edited

Loading

gvanrossum commented May 19, 2021

markshannon commented May 27, 2021

markshannon commented Jul 20, 2021

markshannon commented Jul 26, 2021

Move the data and control (block) stacks on the thread and remove frame objects. #31

Move the data and control (block) stacks on the thread and remove frame objects. #31

Comments

markshannon commented Mar 25, 2021

Implementation

ControlBlocks

Frame layout on data stack

Python-to-python calls

Generators and coroutines

Frame objects

Implementing sys._getframe()

Example control blocks:

ericsnowcurrently commented Mar 26, 2021

gvanrossum commented Mar 26, 2021 via email

markshannon commented Apr 9, 2021

markshannon commented May 19, 2021 • edited Loading

Activation Records

Layout

Option A:

Option B:

Linkage section

Note:

Specials section

Frame objects

PEP 523 -- Adding a frame evaluation API to CPython

gvanrossum commented May 19, 2021

markshannon commented May 27, 2021

markshannon commented Jul 20, 2021

markshannon commented Jul 26, 2021

Implementing `sys._getframe()`

markshannon commented May 19, 2021 •

edited

Loading