Skip to content

Move the data and control (block) stacks on the thread and remove frame objects. #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
markshannon opened this issue Mar 25, 2021 · 8 comments

Comments

@markshannon
Copy link
Member

Python, like almost all other languages, operates on a stack of call frames.
Most languages use a continuous stack for calls because it is much more efficient.
Obviously, C and all languages that are pre-compiled use the OS/hardware provided stack.
But even interpreted languages use a continuous stack. Java and Lua are obvious examples.

Python should do the same. Allocating frames on the heap, "zombie" frames, and the excessive copying of arguments are all slow and unnecessary.

Implementation

Each thread needs two stacks, a data stack and a control stack. These can be continuous or chunked. Chunked gives us most of the performance advantages of continuous and is more flexible, but is more complex to implement.

The data stack is a (conceptually) infinite stack of PyObject *s (or PyValues with tagging).
The control stack is (conceptually) infinite stack of ControlBlocks.

To efficiently implement overflow checks, stacks should be power-of-2 sized and aligned on their size.

For performance it is probably a good idea to arrange that the control stack cannot overflow unless the data stack does first. That way there is no need to check it for overflow. This can be done by choosing some ratio R and ensuring that for all code-objects (locals + 3) >= (1+block_stack_size)*R, which can be done by inflating the number of locals if necessary, and that the data stack has no more than R times as many entries as the control stack. 4 is probably a good value for R.

ControlBlocks

A control block will be mostly one of:

  • Return block, contains IP of caller.
  • Try block, contains handler and stack-depth

Additional types of ControlBlocks can be used for transfer in and out of generators, exits from the interepreter at the C level, and other flow control tasks.

Frame layout on data stack

Each frame consists of local variables (including cells) followed by globals, builtins, and locals, followed by the evaluation stack.

Python-to-python calls

Assuming that calls are effectively specialized (by #28) then making a call will require the following operations:

  1. Make the frame-pointer point to the first argument.
  2. Shuffle the arguments into the new locals and fill in the defaults. For many calls this will be a no-op as the arguments will match the parameters.
  3. Increase the stack-pointer to point just past the new locals.

Generators and coroutines

Generators will need to contain space for their own local variables and control stack (for exceptions, not calls).
The compiler will need to be modified to:

  • Emit special opcodes for local loads and stores.
  • Make sure that the evaluation stack is empty across yields, by saving iterators and other temporaries into local variables.

Frame objects

Frame objects are widely used for debugging and introspection. So we must support them in a reasonably efficient fashion.

Upon making a call, we push a ControlBlock. This "call" block will contain a pointer (initially NULL) pointing to the frame object. Should we ever need a frame object, say from sys._getframe(), we lazily create one that points back to the control block for that frame. On exiting the function, the frame can be discarded or, if still live, the locals can be copied into the frame.

Implementing sys._getframe()

To find the nth frame we walk the control stack until we find the nth "call" block, then read the frame-object from that.
If it is NULL, we create a new one and store it into the control block.

Example control blocks:

"Call" block:

  • int frame_pointer_offset
  • int instruction_offset
  • PyObject *frame_object (usually NULL)

"Generator" block:

  • int yield_pointer_offset -- Where to jump to on yield
  • int return_pointer_offset -- Where to jump to on return

"Try" block:

  • int handler_offset
  • int stack_depth
@ericsnowcurrently
Copy link
Collaborator

This is definitely an interesting idea. I expect that this would result in significant churn in ceval.c but, assuming meaningful performance improvement and/or maintainability, that would be justifiable.

Here are some questions:

  • what sort of performance improvement would you expect?
  • would this mean that (non-arg) locals would be allocated out of the data stack rather than the current allocators? (presumably they would not participate in GC)
  • would the updated ceval.c be easier or harder to understand (and maintain)?
    • how hard would it be for a new reader of the code to reach a basic working knowledge?
    • what parts become simpler?
    • what parts become more complex?
  • what improvements does this allow us to pursue that the current approach prevents?
  • likewise, what does this prevent that the current approach allows?

Also, some bonus observations:

  • the proposed approach could be a big win if the cost of func call cleanup drops down to the nominal range for the other sections of code related to Python calls
    • (via some naive profiling I did, relative to _PyEval_EvalFrameDefault(), where the eval loop lives)
    • the setup code before entering the eval loop, as well as the code at the top and bottom of the loop are generally faster than the fastest opcode cases (e.g. ~1µs)
    • the cleanup code after the loop exits is generally slower, sometimes massively, with a wide range (e.g. 2µs - 1200µs)
  • this implies that PyInterpreterState.eval_frame (from PEP 523) will be eliminated (or perhaps it would trigger a branch somewhere that preserves the current behavior)

@gvanrossum
Copy link
Collaborator

gvanrossum commented Mar 26, 2021 via email

@markshannon
Copy link
Member Author

I think this is partly achievable for 3.10. See #43 for the more limited form.

@markshannon
Copy link
Member Author

markshannon commented May 19, 2021

Here's a draft write up describing the activation record stack, without frame objects:

Activation Records

Each Python function activation has an activation record.
Most activation records form a contiguous per-thread stack, but generators and coroutines have heap allocated activation records.

Layout

Each activation record consists of four sections:

  • Local variables (including arguments, cells and free variables)
  • Specials: the globals dict, builtins dicts, "slow" locals dict, if any, and the code object.
  • Linkage: Includes pointers to previous activation record.
  • Evaluation stack

The linkage and specials sections are the same size for all frames, which means that a pointer to the specials, linkage, or base of the stack are equivalent, as each can be computed from the others by adding a small constant.

There are two options for layout. Option A allows opverlapping frames and minimal movement
of values in a call. Option B minimizes the amount of linkage required.

Option A:

  1. Local variables
  2. Linkage
  3. Specials
  4. Evaluation stack

This layout requires three pointer to efficiently access:

  • Frame pointer -- Points to local[0]
  • Stack base pointer -- Used to access specials, linkage.
  • Stack pointer -- Points to top of evaluation stack.

Option B:

  1. Linkage
  2. Specials
  3. Local variables
  4. Evaluation stack

This layout only requires two pointers:

  • Frame pointer -- Points to local[0], also allows access to specials and linkage.
  • Stack pointer -- Points to top of evaluation stack.

Computing the base of the evaluation stack becomes a bit more expensive, but not that much.

Linkage section

The linkage section contains the above pointers for the caller's activation,
plus the return address, i.e. instruction pointer/offset of the caller.

A pointer to the specials/linkage of the current activation will be stored in the
current CFrame. Doing so allows the full stack of activation records to be traversed from
the thread state object at minimal runtime cost to maintain the link.

Note:

In a contiguous stack, we would need to save one fewer registers, as the top of the caller's
activation record would be the same at the base of the callee's. However, since some activation
records are kept on the heap we cannot do this.

Specials section

The specials sections contains the following pointers:

  • Globals dict
  • Builtins dict
  • Locals dict (not the "fast" locals, but the locals for eval and class creation)
  • Code object
  • Heap allocated FrameObject for this activation record, if any.

Frame objects

To enable introspection, a heap allocated FrameObject is lazily created when the a stack frame is introspected, or a traceback is created.
These FrameObject will reference the activation record, and when that activation record goes
out of scope, will copy it to the heap.

This gives the appearance of persistent, heap-allocated activation records with minimal runtime overhead

PEP 523 -- Adding a frame evaluation API to CPython

PEP 523 requires that a FrameObject is created for every call to a Python function.
Should the eval_frame of the current interpreter be anything but the default,
a FrameObject will be created and the PEP 523 speciifed API be used.

@gvanrossum
Copy link
Collaborator

Computing the base of the stack becomes a bit more expensive, but not that much.

Why would you even need to know the base of the stack? Just to detect underflow?

@markshannon
Copy link
Member Author

That should have been "evaluation stack"

@markshannon
Copy link
Member Author

python/cpython#27077

@markshannon
Copy link
Member Author

Done 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants