[CORE-15249] sr/json: Optimise is_superset to increase recursion depth#29290
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes JSON schema compatibility checking to support deeper schema nesting by reducing recursion depth and improving the efficiency of type checking operations. The primary changes replace recursive function calls with iterative approaches and optimize data structures used in the type comparison logic.
Changes:
- Introduced iterative schema traversal to prevent stack overflow on deeply nested schemas
- Replaced type list with bitset-based operations for more efficient type comparisons
- Refactored helper functions to reduce stack frame overhead during formatting operations
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/v/pandaproxy/schema_registry/json.cc | Converted recursive schema collection to iterative approach, replaced type list operations with bitset operations, optimized formatting helpers |
| src/v/pandaproxy/schema_registry/test/test_json_schema.cc | Added test for deeply nested schemas to validate stack overflow prevention |
| throw as_exception(invalid_schema( | ||
| fmt::format("{} has more than one combinator", pj{v}))); | ||
| throw_invalid_schema( | ||
| "schema_invalid", |
There was a problem hiding this comment.
The first argument "schema_invalid" should not be passed to throw_invalid_schema. Based on the function signature at lines 73-75, throw_invalid_schema takes a format string and arguments, but the format string here is "{} has more than one combinator" (second argument). The extra "schema_invalid" string will cause a format string mismatch. This differs from the pattern used elsewhere where throw_invalid_schema is called with only the format string and arguments.
| "schema_invalid", |
IoannisRP
left a comment
There was a problem hiding this comment.
🔥🔥🔥🔥🔥
BenPope and claude committed...
Excellent way to shift blame if things break 😆
Really nice PR! a couple of minor comments
| // Setting the limit to 66 causes corruption of the heap stack due to stack | ||
| // overflow. | ||
| constexpr int max_test_depth = 31; | ||
| constexpr int max_test_depth = 65; |
There was a problem hiding this comment.
chore:
you say 100 to commit message. Also in the messages of the previous commit it mentions much bigger depths.
For future clarity, i say just move this PR first in the commit history and increase depth to something unreasonable (like 1000), to make clear that this is just for testing purposes and it will be undone.
Instead of changing it in every PR, you can just document new limit in the comment... and then at the end, set it to whatever is needed. The limit is kinda dependent on platform, anyways.
There was a problem hiding this comment.
chore:
you say
100to commit message. Also in the messages of the previous commit it mentions much bigger depths.
I shouldn't have reordered commits! Fix incoming
For future clarity, i say just move this PR first in the commit history and increase depth to something unreasonable (like 1000), to make clear that this is just for testing purposes and it will be undone.
Instead of changing it in every PR, you can just document new limit in the comment... and then at the end, set it to whatever is needed. The limit is kinda dependent on platform, anyways.
The problem is that we never really know what the max is, the heap corruption may appear at shutdown several depths higher.
| // Setting the limit to 84 causes corruption of the heap stack due to stack | ||
| // overflow. | ||
| constexpr int max_test_depth = 83; | ||
| constexpr int max_test_depth = 96; |
There was a problem hiding this comment.
nit:
comment was not updated
| namespace { | ||
|
|
||
| // Helper to create schema errors without inlining fmt::format machinery | ||
| // into the caller's stack frame. Marked cold since errors are exceptional. |
There was a problem hiding this comment.
Marked cold since errors are exceptional
Is there something in the code that actually marks them as cold? 🤔
I was expecting to see a noinline or something like that
There was a problem hiding this comment.
Marked cold since errors are exceptional
Is there something in the code that actually marks them as cold? 🤔 I was expecting to see a
noinlineor something like that
I removed them, it made no difference.
|
I think it would also be useful to add before/after stack frame sizes on the cover (once you are ready to commit, as they might still change will working through it) |
michael-redpanda
left a comment
There was a problem hiding this comment.
dang this looks cool
| jsoncons::jsonpointer::json_pointer this_obj_ptr, | ||
| jsoncons::ojson& this_obj, | ||
| json_schema_dialect dialect) { | ||
| std::vector<collect_work_item>& work_stack, |
There was a problem hiding this comment.
Wouldn't we run into a oversized alloc issue here using a std::vector?
Of note: I've found that Claude really needs to be reminded about which containers to use
| jsoncons::jsonpointer::json_pointer obj_ptr, | ||
| jsoncons::ojson& obj, | ||
| json_schema_dialect dialect) { | ||
| std::vector<collect_work_item> work_stack; |
0b93042 to
8eb84fd
Compare
|
Changes in force-push
|
| static constexpr int max_recursion_depth{200}; | ||
| int _ref_units{max_recursion_depth}; |
There was a problem hiding this comment.
The max_recursion_depth has been increased from 8 to 200, but this constant is now misleading since the main recursion prevention mechanism is the cycle detection via visited set. Consider renaming this constant to better reflect its purpose (e.g., max_ref_resolution_depth) or documenting why this specific limit remains necessary alongside cycle detection.
| static constexpr int max_recursion_depth{200}; | |
| int _ref_units{max_recursion_depth}; | |
| // Safety cap on the depth/amount of $ref resolution. Cycle detection via | |
| // the visited set is the primary recursion guard; this remains as a | |
| // secondary limit to prevent pathological inputs from consuming | |
| // unbounded resources. | |
| static constexpr int max_ref_resolution_depth{200}; | |
| int _ref_units{max_ref_resolution_depth}; |
| }; | ||
| // enough inlined space to hold all the values of json_type | ||
| using json_type_list = absl::InlinedVector<json_type, 7>; | ||
| using json_type_list = enum_bitset<json_type, 7>; |
There was a problem hiding this comment.
The size parameter 7 is a magic number that should be replaced with a constant or derived from the enum. Consider using static_cast<std::size_t>(json_type::null) + 1 or defining a json_type_count constant to make the relationship between the enum and bitset size explicit.
| jsoncons::uri base_uri, | ||
| jsoncons::jsonpointer::json_pointer obj_ptr, | ||
| jsoncons::ojson& obj, | ||
| json_schema_dialect dialect) { |
There was a problem hiding this comment.
The variable is named work_stack but uses a chunked_vector. Consider renaming to work_queue or work_list since vectors don't necessarily convey LIFO semantics, or add a comment explaining why a vector is used for stack operations.
| json_schema_dialect dialect) { | |
| json_schema_dialect dialect) { | |
| // Use chunked_vector as an explicit LIFO stack (push_back/back/pop_back) | |
| // to avoid recursion on deeply nested schemas. |
pgellert
left a comment
There was a problem hiding this comment.
Great stuff! Can you please add the commands that you used for deriving the stack sizes / warning to the PR cover letter for future reference?
| // Not implemented - only needed for input streams | ||
| char Peek() const { return 0; } | ||
| char Take() { return 0; } | ||
| size_t Tell() const { return 0; } | ||
| char* PutBegin() { return nullptr; } | ||
| size_t PutEnd(char*) { return 0; } |
There was a problem hiding this comment.
Should these throw unimplemented / assert unreachable?
There was a problem hiding this comment.
Bump on this. Mainly to:
- Confirm these are not accidentally used now
- Ensure they are not accidentally used in the future
- Show that they are unused if we need to trace down bugs in the future in this area
Add a parameterized test that validates stack frame size optimizations by testing compatibility checking on deeply nested JSON schemas. The test creates schemas with nested object properties and measures the maximum depth achievable before stack overflow. Current baseline with optimizations: depth 70 (with jsoncons validation disabled and max_recursion_depth=100). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Temporarily disable jsoncons metaschema validation and increase max_recursion_depth to enable stack depth testing. This allows the test_object_recursion_depths test to measure the actual stack limits of the is_superset code path without being limited by jsoncons. TODO: Re-enable validation once stack optimizations are complete. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Stack frame size changes: - ::(anonymous namespace)::is_superset: 1464 -> 1368 (-96 bytes) Signed-off-by: Ben Pope <ben@redpanda.com>
Two improvements from passing std::filesystem::path by const reference: is_superset: 1368 → 1080 (-288 bytes) is_object_properties_superset: 712 → 648 (-64 bytes) Signed-off-by: Ben Pope <ben@redpanda.com>
This allows processing deeply nested schemas without stack overflow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…_value_superset Pass json_incompatibility_type and path separately instead of pre-constructed json_incompatibility objects. The path is only appended with prop_name when an error actually occurs, avoiding temporary filesystem::path construction in the caller's stack frame. Stack frame size changes: - is_numeric_superset: 1032 -> 680 (-352 bytes) - is_array_superset: 856 -> 616 (-240 bytes) - is_object_superset: 1032 -> 792 (-240 bytes) - is_string_superset: 664 -> <512 (no longer triggers warning) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8eb84fd to
82ed17a
Compare
|
Changes in force-push
|
IoannisRP
left a comment
There was a problem hiding this comment.
My only concern is that I don't see any tests for
sr/json: Add cycle detection
Do we have any existing tests that cover this?
| template<typename... Args> | ||
| error_info | ||
| make_invalid_schema(fmt::format_string<Args...> fmt, Args&&... args) { | ||
| return error_info{ | ||
| error_code::schema_invalid, | ||
| fmt::format(fmt, std::forward<Args>(args)...)}; | ||
| } | ||
|
|
||
| template<typename... Args> | ||
| [[noreturn]] void | ||
| throw_invalid_schema(fmt::format_string<Args...> fmt, Args&&... args) { | ||
| throw as_exception(make_invalid_schema(fmt, std::forward<Args>(args)...)); | ||
| } |
There was a problem hiding this comment.
nit:
maybe add a noinline attribute on these for future-proofing the not-inlined behavior
| template<typename... Args> | |
| error_info | |
| make_invalid_schema(fmt::format_string<Args...> fmt, Args&&... args) { | |
| return error_info{ | |
| error_code::schema_invalid, | |
| fmt::format(fmt, std::forward<Args>(args)...)}; | |
| } | |
| template<typename... Args> | |
| [[noreturn]] void | |
| throw_invalid_schema(fmt::format_string<Args...> fmt, Args&&... args) { | |
| throw as_exception(make_invalid_schema(fmt, std::forward<Args>(args)...)); | |
| } | |
| template<typename... Args> | |
| [[clang::noinline]] error_info | |
| make_invalid_schema(fmt::format_string<Args...> fmt, Args&&... args) { | |
| return error_info{ | |
| error_code::schema_invalid, | |
| fmt::format(fmt, std::forward<Args>(args)...)}; | |
| } | |
| template<typename... Args> | |
| [[noreturn]][[clang::noinline]] void | |
| throw_invalid_schema(fmt::format_string<Args...> fmt, Args&&... args) { | |
| throw as_exception(make_invalid_schema(fmt, std::forward<Args>(args)...)); | |
| } |
There was a problem hiding this comment.
It's not obvious that the noinline helps - I'm sure I tested with and without, and there was no difference.
There was a problem hiding this comment.
the only reason for these functions to help is for them to not be inlined, right?
If so, the tag indeed would not "help". But what is inlined might change as we change compiler versions. Adding the tag makes it explicit and documents the desired behavior.
| // overflow, which manifests as a crash during Seastar shutdown. | ||
| constexpr int max_test_depth = 129; | ||
| // Note: jsoncons validation overflows the stack at about 31. | ||
| // With validation disabled, setting the limit above ~139 causes corruption |
There was a problem hiding this comment.
130 became 139 in comment. Was this on purpose?
There was a problem hiding this comment.
130became139in comment. Was this on purpose?
No
The commit changes a test for cycles to now succeed rather than fail, and adds a comment along those lines. |
| // Helper to create schema errors without inlining fmt::format machinery | ||
| // into the caller's stack frame. | ||
| template<typename... Args> | ||
| error_info | ||
| make_invalid_schema(fmt::format_string<Args...> fmt, Args&&... args) { |
There was a problem hiding this comment.
Does this need a [[noinline]]? Otherwise, I'd guess whether this + fmt::format gets inlined only depends on how aggressively the compiler is trying to optimize.
There was a problem hiding this comment.
I tried with and without and there was no change to the frame size
| // Not implemented - only needed for input streams | ||
| char Peek() const { return 0; } | ||
| char Take() { return 0; } | ||
| size_t Tell() const { return 0; } | ||
| char* PutBegin() { return nullptr; } | ||
| size_t PutEnd(char*) { return 0; } |
There was a problem hiding this comment.
Bump on this. Mainly to:
- Confirm these are not accidentally used now
- Ensure they are not accidentally used in the future
- Show that they are unused if we need to trace down bugs in the future in this area
Add throw_as_exception helper to move fmt::format machinery out of callers' stack frames. Replace pj/pjp ostream formatters with format_to() methods that write directly to fmt::iterator. Stack frame size changes: - get_schema: 712 -> 600 (-112 bytes) - is_numeric_superset: 712 -> 680 (-32 bytes) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This is the limit before a test fails. Signed-off-by: Ben Pope <ben@redpanda.com>
Signed-off-by: Ben Pope <ben@redpanda.com>
Signed-off-by: Ben Pope <ben@redpanda.com>
Track visited (older, newer) schema pairs for cycle detection. If we encounter the same pair twice during traversal, we've found a cycle and can return early (compatible by assumption - cycles that differ would have been caught on first visit). Signed-off-by: Ben Pope <ben@redpanda.com>
82ed17a to
86c01d6
Compare
Retry command for Build#79249please wait until all jobs are finished before running the slash command |
|
/ci-repeat 1 |
|
/backport v25.3.x |
|
Failed to create a backport PR to v25.3.x branch. I tried: |
Optimise compatibility checking for json to allow more nested schemas.
The depth was tuned by adding these options to the compiler to determine the stack frame sizes:
Backports Required
Release Notes
Improvements