Rework SerializableByKey handling to improve performance by maffoo · Pull Request #6469 · quantumlib/Cirq

maffoo · 2024-02-19T18:39:40Z

A "contextual serialization" mechanism was added in #3601 and #3673 that allows serializing certain objects once even when they occur multiple times within a larger structure. The only place we use this mechanism in cirq itself is for FrozenCircuit, since a given circuit can be included as a subcircuit multiple times within a larger circuit.

On internal benchmarks, the existing contextual serialization mechanism is a major performance bottleneck, in particular the has_serializable_by_key function takes a significant amount of time even when no contextual serialization is needed because it traverses the entire object being serialized and makes lots of isinstance calls. The reason for these pre-checks is that the context serialization works by changing the outer-most serialized object to instead be a _ContextualSerialization instance, which includes all SerializableByKey instances in an object_dag list created from a pre-order traversal. The current implementation also has the downside that during serialization it finds already-seen instances by traversing the object_dag list itself, potentially incurring quadratic overhead if the list grows large; this is necessary because SerializableByKey instances are not guaranteed to be hashable (though in practice the only existing case of FrozenCircuit is hashable).

This mechanism can be simplified by simply keeping track of SerializableByKey instances as we encounter them and associating them with keys. The first time such an instance is encountered during serialization it is assigned a key and then serialized as a "value": {"cirq_type": "VAL": "key": key, "val": val} where key is the assigned key and val is the normal serialization of the object. If the same instance is encountered later during serialization, it is serialized as a "reference": {"cirq_type": "REF", "key": key}. We require that SerializableByKey instances be hashable so that the internal memoization can use a dict instead of a list, to avoid quadratic overheads from repeatedly searching for memoized instances. This would potentially allow us to use this mechanism on many more types without worrying about the runtime impact. Note that this requires that the traversal order when deserializing be consistent with that used when serializing, so that we never encounter a REF before the corresponding VAL. AFAICT this will always be the case, because the json library respects the order in which object keys appear when deserializing (functions like json.load can take an object_pairs_hook which takes key-value pairs as a list, and when these are collected in a dict and passed to object_hook the python dict implementation also preserves order).

This change gives a significant improvement in serialization performance; on one internal benchmark the time spent in json_serialization.to_json when profiling dropped from 719s to 39s for an 18x speedup.

There's still a bit of work to do to clean up json test cases since in a few places we already had existing .json_inward tests and I need to figure out how to preserve those as well as moving the existing .json tests to be .json_inward tests. But otherwise this is ready for review and would be good to discuss the proposed approach.

maffoo · 2024-02-19T18:42:45Z

In #6115 we are considering migrating to orjson to replace the std-library json implementation. We should check that this proposal still works in that case, e.g. that orjson preserves key order during deserialization. Note that the performance of contextual serialization was also called out as a problem in that PR, though the overhead was not as bad as what we've observed in internal benchmarks as noted above. This performance will depend heavily on the specifics of what is being serialized and how expensive it is to traverse.

codecov · 2024-02-19T19:07:30Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (a4ec796) 97.82% compared to head (98ccf83) 97.82%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6469      +/-   ##
==========================================
- Coverage   97.82%   97.82%   -0.01%     
==========================================
  Files        1115     1115              
  Lines       97468    97390      -78     
==========================================
- Hits        95347    95268      -79     
- Misses       2121     2122       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

95-martin-orion

Review is focused on python code, since the JSON files should be fully verified by tests.

LGTM! Happy to see this get further refined.

95-martin-orion · 2024-02-20T16:20:08Z

        if hasattr(o, '_json_dict_'):
-            return _json_dict_with_cirq_type(o)
+            json_dict = _json_dict_with_cirq_type(o)
+            if isinstance(o, SerializableByKey):


This shouldn't be necessary - SerializableByKey wraps SupportsJSON without modification, so the hasattr check above already confirms that o is SerializableByKey.

This also suggests that SerializableByKey can be entirely replaced by SupportsJSON, although that may be more effort than it's worth. I suspect that in early iterations SerializableByKey had some functionality, but I removed it before merging without removing the class itself.

It's true that SerializableByKey doesn't add any methods, but it still serves as a marker for types that want to opt in to this "deduplication" during serialization. One this that is being added here is that we now require that SerializableByKey classes must be hashable, which is not necessarily the case with all classes that implement SupportsJSON. If we want to change the opt-in mechanism that would be fine with me; I do think the name SerializableByKey is a bit misleading (and has been since #3673 when the _serialization_key_ method was removed) so at very least we might consider changing the name.

Good call. Renaming feels like it would be a headache to me, but given that this changes the serialization format anyways I guess now would be the time for it.

Happy to review any changes from this here or as a separate PR.

Since the SerializableByKey name itself no longer appears in the serialized format, renaming will be a source-only change and would not affect compatibility of deserializing stored data, so that should be low risk. Nevertheless, since naming always invites bikeshedding, I'm inclined to defer that to a future PR, so I'll leave this as-is for now :-)

95-martin-orion · 2024-02-20T19:08:58Z

-        context_map.update({key: obj})
-
-
-class _ContextualSerialization(SupportsJSON):


[No change needed] From our discussion offline: I think the issue with nested objects I was thinking of was linked to this "context table" implementation I used, which did not allow nested definitions. The new design, which allows a VAL inside of a VAL, should bypass the issue.

@95-martin-orion

…6469) Review: @95-martin-orion

maffoo requested a review from 95-martin-orion February 19, 2024 18:39

maffoo requested review from a team, cduck, verult, vtomole and wcourtney as code owners February 19, 2024 18:39

CirqBot added the size: XL lines changed >1000 label Feb 19, 2024

Use memoization when (de)serializing SerializableByKey instances

29d2312

maffoo force-pushed the u/maffoo/sbk branch from 3f107af to 29d2312 Compare February 19, 2024 18:55

Merge test cases from .json_inward2 into .json_inward files.

98ccf83

95-martin-orion self-assigned this Feb 20, 2024

95-martin-orion approved these changes Feb 20, 2024

View reviewed changes

maffoo merged commit 33eea01 into main Feb 20, 2024

maffoo deleted the u/maffoo/sbk branch February 20, 2024 20:04

harry-phasecraft pushed a commit to PhaseCraft/Cirq that referenced this pull request Oct 31, 2024

Rework SerializableByKey handling to improve performance (quantumlib#…

4a62a1d

…6469) Review: @95-martin-orion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework SerializableByKey handling to improve performance#6469

Rework SerializableByKey handling to improve performance#6469
maffoo merged 2 commits intomainfrom
u/maffoo/sbk

maffoo commented Feb 19, 2024

Uh oh!

maffoo commented Feb 19, 2024

Uh oh!

codecov bot commented Feb 19, 2024 •

edited

Loading

Uh oh!

95-martin-orion left a comment

Uh oh!

95-martin-orion Feb 20, 2024

Uh oh!

maffoo Feb 20, 2024

Uh oh!

95-martin-orion Feb 20, 2024

Uh oh!

maffoo Feb 20, 2024

Uh oh!

95-martin-orion Feb 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		context_map.update({key: obj})


		class _ContextualSerialization(SupportsJSON):

Conversation

maffoo commented Feb 19, 2024

Uh oh!

maffoo commented Feb 19, 2024

Uh oh!

codecov bot commented Feb 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

95-martin-orion left a comment

Choose a reason for hiding this comment

Uh oh!

95-martin-orion Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

maffoo Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

95-martin-orion Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

maffoo Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

95-martin-orion Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Feb 19, 2024 •

edited

Loading