-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix: extend recursive protection to prevent stack overflows in additional functions #16506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🤖 |
🤖: Benchmark completed Details
|
Thank you for this PR @ahmed-mez 🤔 the benchmarks seem to imply this has some non trivial overhead (slows down planning by 1-2%). I wonder if there is any way to restructure the affected codepaths to avoid the recursion (change them into an iterative version, for example, or Also, while I appreciate you can't provide the exact reproducer, I think it is important that we have some tests for this change, otherwise it is likely that over time as the code is refactored we'll either lose the annotations or the paths will be different and not annotated |
Thank you @alamb for reviewing and for running the benchmark. I will work on adding a test as a first step, that would also help us iterate and validate any changes later on. In general, performance vs reliability trade-offs depend heavily on the specific context and use cases. In my opinion, giving users the ability to make that decision themselves is a strong and flexible approach. In this case, the reliability concern isn’t minor. Avoiding stack overflows is critical, as they can bring down the entire process. Personally, I would gladly accept even more than a 1–2% slowdown in planning time to gain protection against such failures. I’d really appreciate it if DataFusion provided a "knob" that allowed users to express this kind of preference. Concretely, that'd be a code base where every single recursive function is protected and users have the choice to enable that protection. Currently there is the |
Yeah this makes sense to me to I was thinking sometimes we don't have to choose between "panic" and slower planning -- by restructuring the code sometimes we can do much better |
Which issue does this PR close?
Fixes stack overflows caused by deeply nested query plans during recursive function calls in various optimizer and expression evaluation paths.
Rationale for this change
DataFusion was experiencing stack overflow panics when processing deeply nested query plans. The issue occurred because many functions throughout the codebase perform recursive operations on query plans and expressions without stack overflow protection. When these operations encounter deeply nested structures, they can exhaust the call stack, causing the application to crash.
This change extends the existing recursive protection mechanism to additional functions across the optimizer, physical expression planner, and other core components that were previously unprotected.
Note that #16404 also helped mitigate the stack overflows I encountered with my reproducer, very likely because cloning
Expr
is not stack friendly especially whenExpr
is deeply nested. This is another data point in favour of #9577.What changes are included in this PR?
#[cfg_attr(feature = "recursive_protection", recursive::recursive)]
annotations to various recursive functions across multiple cratesrecursive_protection
feature todatafusion-physical-expr
anddatafusion-substrait
cratesAre these changes tested?
The changes are primarily defensive additions that activate existing recursion protection when the
recursive_protection
feature is enabled. The recursion protection mechanism itself is already used in the codebase. Ideally we should work on automating the detection of lacking recursion protection.About the reproducer: it's a 5MB+ Substrait JSON plan that's deeply nested (but valid). I don't see value in sharing it publicly as it contains various custom extensions and can't be used without an adequate Datadog-specific setup. However It helped me debug and validate the proposed changes.
Are there any user-facing changes?
No