Skip to content

schema_registry/test: lower json recursion depth locally#29546

Merged
pgellert merged 2 commits into
redpanda-data:devfrom
pgellert:fix/local-json-recursion-limit
Feb 6, 2026
Merged

schema_registry/test: lower json recursion depth locally#29546
pgellert merged 2 commits into
redpanda-data:devfrom
pgellert:fix/local-json-recursion-limit

Conversation

@pgellert

@pgellert pgellert commented Feb 5, 2026

Copy link
Copy Markdown
Contributor

Locally the test is failing with a stack overflow at a lower recursion limit, likely because of machine/OS/build-type differences.

So, lower the limit to ensure that tests pass locally, while still keeping the limit as is for CI to pick up any regressions.

Related to: #29290

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

Locally the test is failing at a lower recursion limit, likely because
of machine/OS/build-type differences.

So, lower the limit to ensure that tests pass locally, while still
keeping the limit as is for CI to pick up any regressions.
@pgellert pgellert requested a review from a team February 5, 2026 18:31
@pgellert pgellert self-assigned this Feb 5, 2026
@pgellert pgellert requested review from IoannisRP, Copilot and nguyen-andrew and removed request for a team February 5, 2026 18:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the CI environment detection logic by moving it from a local helper function in security/tests/license_utils.h to a shared utility function in test_utils/test_env. The new shared function is then used to adjust the JSON schema recursion depth test limits based on whether the tests are running in CI or locally, addressing test failures on local development machines.

Changes:

  • Added a shared is_on_ci() function in test_utils/test_env to detect CI environment
  • Refactored existing CI detection code to use the new shared utility
  • Adjusted JSON schema recursion depth test to use lower limits locally (17) while maintaining higher limits (30) in CI

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/v/test_utils/test_env.h Declares new is_on_ci() function
src/v/test_utils/test_env.cc Implements CI detection using environment variable check
src/v/test_utils/BUILD Adds abseil strings dependency for case-insensitive comparison
src/v/security/tests/license_utils.h Removes local is_on_ci() implementation and uses shared version
src/v/security/tests/BUILD Updates dependency from abseil strings to test_env
src/v/pandaproxy/schema_registry/test/test_json_schema.cc Changes max_test_depth from constant to runtime value based on CI detection
src/v/pandaproxy/schema_registry/test/BUILD Adds test_env dependency

@@ -2376,7 +2377,7 @@ SEASTAR_THREAD_TEST_CASE(test_object_recursion_depths) {
// With validation disabled, setting the limit above ~130 causes corruption
// of the heap due to stack overflow, which typically manifests as a crash
// during Seastar shutdown, or during is_superset.

Copilot AI Feb 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from constexpr int to const int and the different values (30 vs 17) warrant a comment explaining why the local environment requires a lower recursion depth. This would help future maintainers understand the reasoning behind these specific values and the platform-dependent behavior.

Suggested change
// during Seastar shutdown, or during is_superset.
// during Seastar shutdown, or during is_superset.
// CI builds typically have a larger effective stack (and different build
// settings) than local developer runs, so they can safely test up to depth
// 30. Local environments have been observed to overflow the stack at lower
// depths (e.g. with sanitizers or smaller thread stacks), so we cap them at
// 17 to avoid heap/stack corruption. These values are empirical and
// platform-dependent; adjust only with care.

Copilot uses AI. Check for mistakes.
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Retry command for Build#80231

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#80231
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/80231#019c2f23-6dd4-4e69-b5fb-ec8d1ee94d6a FLAKY 8/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0110, p0=0.0052, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/80231#019c2f21-a617-4018-9c66-a84a22045d3a FLAKY 23/31 Test PASSES after retries.No significant increase in flaky rate(baseline=0.1098, p0=0.0405, reject_threshold=0.0100. adj_baseline=0.2947, p1=0.3032, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
VerifyConsumerOffsetsThruUpgrades test_consumer_group_offsets {"versions_to_upgrade": 2} integration https://buildkite.com/redpanda/redpanda/builds/80231#019c2f21-a61f-4ec3-af8c-faddbfd88b7e FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0009, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=VerifyConsumerOffsetsThruUpgrades&test_method=test_consumer_group_offsets

@dotnwat dotnwat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely because of machine/OS/build-type differences.

in what way is it failing?

@pgellert

pgellert commented Feb 6, 2026

Copy link
Copy Markdown
Contributor Author

in what way is it failing?

With a stack overflow. Json schemas need to be validated against a json metaschema and we use the jsoncons library to validate them for us. Their validation logic uses a recursion-based DFS to validate the schema, which can trigger stackoverflow at deep recursion depths.

@pgellert pgellert requested a review from dotnwat February 6, 2026 09:21
@pgellert pgellert merged commit 5e266c2 into redpanda-data:dev Feb 6, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants