[train] fix maximum recursion issue when serializing exceptions #43952
[train] fix maximum recursion issue when serializing exceptions #43952matthewdeng merged 6 commits intoray-project:masterfrom
Conversation
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
|
I see. So it's because:
|
| assert i == levels - start_traceback + 1 | ||
|
|
||
|
|
||
| def test_recursion(): |
There was a problem hiding this comment.
Any ideas on why Maximum recursion happens iff pickling_support.install() is called? Should we also test it here?
There was a problem hiding this comment.
Seems like a bug in tblib.
Good point on testing. Originally I was going to remove tblib as a dependency in a separate PR, but even if I do I can add it as a test dependency.
woshiyyya
left a comment
There was a problem hiding this comment.
Thanks for the great effort delving deep into this common but Intricate bug!
Minor correction here, in this case |
Signed-off-by: Matthew Deng <matt@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
Great debugging job on this hard problem!
|
|
||
| # Else, make sure nested exceptions are properly skipped | ||
| # Perform a shallow copy to prevent recursive __cause__/__context__. | ||
| new_exc = copy.copy(exc).with_traceback(exc.__traceback__) |
There was a problem hiding this comment.
Note: with_traceback is needed so that the traceback shows the original line that errored, rather than this line where the copy is happening.
There was a problem hiding this comment.
Does making a shallow copy remove the __context__ so that the new_exc has no context?
There was a problem hiding this comment.
Yes, but more importantly the __context__ gets set (to the StartTraceback) after the exception gets raised, so we want to make sure there is no nested __cause__ or __context__ that points back to this exception.
…project#43952) Signed-off-by: Matthew Deng <matt@anyscale.com>
…project#43952) Signed-off-by: Matthew Deng <matt@anyscale.com>
Why are these changes needed?
This fixes an issue where the output exception is non-serializable due to maximum recursion. This issue surfaces only when
tblibis enabled, which happens by default on certain imports (e.g.tensorflowor any other libraries that importtensorflowtransitively).The problem occurs because the exception returned by
skip_tracebackwill point to the original exception that was raised. When this is re-raised, it now has a reference to theStartTracebackas the__context__or__cause__. This original exception is also the__cause__of theStartTraceback, which leads to infinite recursion in trying to traverse these exceptions.The solution is to make a shallow copy of the final exception, with the original
__traceback__retained so that the output retains the original traceback and not the one where it is re-raised.Repro Script:
Before:
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.