Skip to content

[train] fix maximum recursion issue when serializing exceptions #43952

Merged
matthewdeng merged 6 commits intoray-project:masterfrom
matthewdeng:recursive-exceptions
Mar 18, 2024
Merged

[train] fix maximum recursion issue when serializing exceptions #43952
matthewdeng merged 6 commits intoray-project:masterfrom
matthewdeng:recursive-exceptions

Conversation

@matthewdeng
Copy link
Contributor

@matthewdeng matthewdeng commented Mar 13, 2024

Why are these changes needed?

This fixes an issue where the output exception is non-serializable due to maximum recursion. This issue surfaces only when tblib is enabled, which happens by default on certain imports (e.g. tensorflow or any other libraries that import tensorflow transitively).

The problem occurs because the exception returned by skip_traceback will point to the original exception that was raised. When this is re-raised, it now has a reference to the StartTraceback as the __context__ or __cause__. This original exception is also the __cause__ of the StartTraceback, which leads to infinite recursion in trying to traverse these exceptions.

The solution is to make a shallow copy of the final exception, with the original __traceback__ retained so that the output retains the original traceback and not the one where it is re-raised.

Repro Script:

from ray import cloudpickle
from ray.air._internal.util import StartTraceback, skip_exceptions
from tblib import pickling_support

pickling_support.install() # Maximum recursion happens iff this is called.

try:
    try:
        try:
            raise Exception("Root Exception")
        except Exception as root_exception:
            raise StartTraceback from root_exception
    except Exception as start_traceback:
        raise skip_exceptions(start_traceback)
except Exception as skipped_exception:
    cloudpickle.dumps(skipped_exception)

Before:

Traceback (most recent call last):
  File "/home/ray/default/repro.py", line 12, in <module>
    raise StartTraceback from root_exception
ray.air._internal.util.StartTraceback

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/default/repro.py", line 14, in <module>
    raise skip_exceptions(start_traceback)
  File "/home/ray/default/repro.py", line 10, in <module>
    raise Exception("Root Exception")
Exception: Root Exception

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle.py", line 1245, in dump
    return super().dump(obj)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle.py", line 1321, in reducer_override
    t = type(obj)
RecursionError: maximum recursion depth exceeded while calling a Python object

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ray/default/repro.py", line 16, in <module>
    cloudpickle.dumps(skipped_exception)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle.py", line 1479, in dumps
    cp.dump(obj)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle.py", line 1249, in dump
    raise pickle.PicklingError(msg) from e
_pickle.PicklingError: Could not pickle object as excessively deep recursion required.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
@woshiyyya
Copy link
Member

woshiyyya commented Mar 15, 2024

I see. So it's because:

  1. This makes StartTraceback.__cause__ == root_exception
    except Exception as root_exception:
            raise StartTraceback from root_exception
  1. skip_exceptions(start_traceback) returns the original root_exception.

  2. This makes root_exception.__cause__ == StartTraceback, which makes it a circular reference.

except Exception as start_traceback:
        raise skip_exceptions(start_traceback)

assert i == levels - start_traceback + 1


def test_recursion():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas on why Maximum recursion happens iff pickling_support.install() is called? Should we also test it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a bug in tblib.

Good point on testing. Originally I was going to remove tblib as a dependency in a separate PR, but even if I do I can add it as a test dependency.

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great effort delving deep into this common but Intricate bug!

@matthewdeng
Copy link
Contributor Author

This makes root_exception.__cause__ == StartTraceback, which makes it a circular reference.

except Exception as start_traceback:
        raise skip_exceptions(start_traceback)

Minor correction here, in this case root_exection.__context__ == StartTraceback, but similarly this would create the circular reference!

Signed-off-by: Matthew Deng <matt@anyscale.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great debugging job on this hard problem!


# Else, make sure nested exceptions are properly skipped
# Perform a shallow copy to prevent recursive __cause__/__context__.
new_exc = copy.copy(exc).with_traceback(exc.__traceback__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: with_traceback is needed so that the traceback shows the original line that errored, rather than this line where the copy is happening.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does making a shallow copy remove the __context__ so that the new_exc has no context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but more importantly the __context__ gets set (to the StartTraceback) after the exception gets raised, so we want to make sure there is no nested __cause__ or __context__ that points back to this exception.

@matthewdeng matthewdeng marked this pull request as ready for review March 18, 2024 21:33
@matthewdeng matthewdeng merged commit 4e693f6 into ray-project:master Mar 18, 2024
@matthewdeng matthewdeng deleted the recursive-exceptions branch March 18, 2024 21:34
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Mar 27, 2024
votrou pushed a commit to pinterest/ray that referenced this pull request Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants