Skip to content

Look into string sharing for deep-frozen code #218

Closed
@gvanrossum

Description

@gvanrossum

When code objects are created by the compiler or by marshal, certain strings are interned using PyUnicode_InternInPlace(). But code objects obtained from the deep-freeze process do not do this.

While strings are deduped within a module, certain strings (e.g. dunders, 'name', 'self') occur frequently in different modules and thus the deep-frozen code object will use up slightly more space, and certain operations will be slightly slower (e.g. comparison of two strings that are both interned is done by pointer comparison).

We could do a number of different things (or combine several):

  • Add PyUnicode_InternInPlace() calls to the "get toplevel" function in each deepfrozen .c file. This would still waste the space though.
  • Merge all deepfrozen files into a single file and dedupe strings when that file is written. Requires changing the deepfreeze build procedures for Win and Unix. An advantage is that we could dedupe other things (basically all constants, even bytecode) this way. (@markshannon seems to lean towards this one.)
  • Give strings that are likely candidates (i.e., that look like ASCII identifiers -- see all_name_chars() in codeobject.c) an external name with "weak linkage" so that the linker can dedupe them. (Props to @lpereira for this one.)
  • For strings that occur in the array of known _Py_Identifiers, replace the string with a reference into that array, for pure savings. (@ericsnowcurrently has more details; IIRC this array isn't in "main" yet.)

[I am not planning to attack this any time soon, so if somebody wants to tackle this, go ahead and assign to yourself.]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions