Closed
Description
When code objects are created by the compiler or by marshal, certain strings are interned using PyUnicode_InternInPlace()
. But code objects obtained from the deep-freeze process do not do this.
While strings are deduped within a module, certain strings (e.g. dunders, 'name', 'self') occur frequently in different modules and thus the deep-frozen code object will use up slightly more space, and certain operations will be slightly slower (e.g. comparison of two strings that are both interned is done by pointer comparison).
We could do a number of different things (or combine several):
- Add
PyUnicode_InternInPlace()
calls to the "get toplevel" function in each deepfrozen .c file. This would still waste the space though. - Merge all deepfrozen files into a single file and dedupe strings when that file is written. Requires changing the deepfreeze build procedures for Win and Unix. An advantage is that we could dedupe other things (basically all constants, even bytecode) this way. (@markshannon seems to lean towards this one.)
- Give strings that are likely candidates (i.e., that look like ASCII identifiers -- see
all_name_chars()
in codeobject.c) an external name with "weak linkage" so that the linker can dedupe them. (Props to @lpereira for this one.) - For strings that occur in the array of known
_Py_Identifier
s, replace the string with a reference into that array, for pure savings. (@ericsnowcurrently has more details; IIRC this array isn't in "main" yet.)
[I am not planning to attack this any time soon, so if somebody wants to tackle this, go ahead and assign to yourself.]
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Done