Skip to content

[data][train] Refactor call_with_retry into shared library and use it to retry checkpoint upload#56608

Merged
justinvyu merged 10 commits intoray-project:masterfrom
TimothySeah:tseah/retry-checkpoint-upload
Sep 23, 2025
Merged

[data][train] Refactor call_with_retry into shared library and use it to retry checkpoint upload#56608
justinvyu merged 10 commits intoray-project:masterfrom
TimothySeah:tseah/retry-checkpoint-upload

Conversation

@TimothySeah
Copy link
Contributor

@TimothySeah TimothySeah commented Sep 17, 2025

Summary

This PR moves call_with_retry from ray/data/_internal to ray/_private so that it can be used in other libraries like Ray Train.

It also adds a new retry decorator that wraps around call_with_retry. Note that I had to remove * from call_with_retry's arguments to get the decorator to work on Python object methods because Python passes self as one of the *args.

Finally, it uses this decorator to retry checkpoint uploads.

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested review from a team as code owners September 17, 2025 02:40
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively refactors the retry logic into a shared utility call_with_retry and introduces a convenient @retry decorator. The changes improve code modularity and reusability. The new decorator is correctly applied to add resilience to the checkpoint uploading process in Ray Train. I have a couple of suggestions to improve the logging clarity and documentation of the new retry utility.

@ray-gardener ray-gardener bot added train Ray Train Related Issue core Issues that should be addressed in Ray Core data Ray Data-related issues labels Sep 17, 2025
@edoakes
Copy link
Collaborator

edoakes commented Sep 17, 2025

utilities used by multiple libraries should live in ray._common

please move it there and add standalone tests for the utility

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

)
)

@retry(description="upload checkpoint", max_attempts=3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this 3 configurable? also, I think we should match specific "upload errors" similar to data's whitelist:

https://github.com/anyscale/rayturbo/blob/788a223b6c933a303de0212b34883fdf0a1f4977/python/ray/data/context.py#L162

Otherwise we'll retry on the other "unretryable" errors that we raise explicitly already in _upload_checkpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TODO for retry configurability: could be a good extension to #55861.

Added COMMON_RETRYABLE_TOKENS lmk if that's fine.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Sep 19, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@justinvyu justinvyu merged commit c4be355 into ray-project:master Sep 23, 2025
6 checks passed
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
… to retry checkpoint upload (ray-project#56608)

This PR moves `call_with_retry` from `ray/data/_internal` to
`ray/_private` so that it can be used in other libraries like Ray Train.

It also adds a new `retry` decorator that wraps around
`call_with_retry`. Note that I had to remove `*` from
`call_with_retry`'s arguments to get the decorator to work on Python
object methods because Python passes `self` as one of the `*args`.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: zac <zac@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Sep 24, 2025
… to retry checkpoint upload (#56608)

This PR moves `call_with_retry` from `ray/data/_internal` to
`ray/_private` so that it can be used in other libraries like Ray Train.

It also adds a new `retry` decorator that wraps around
`call_with_retry`. Note that I had to remove `*` from
`call_with_retry`'s arguments to get the decorator to work on Python
object methods because Python passes `self` as one of the `*args`.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
… to retry checkpoint upload (ray-project#56608)

This PR moves `call_with_retry` from `ray/data/_internal` to
`ray/_private` so that it can be used in other libraries like Ray Train.

It also adds a new `retry` decorator that wraps around
`call_with_retry`. Note that I had to remove `*` from
`call_with_retry`'s arguments to get the decorator to work on Python
object methods because Python passes `self` as one of the `*args`.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
elliot-barn pushed a commit that referenced this pull request Sep 27, 2025
… to retry checkpoint upload (#56608)

This PR moves `call_with_retry` from `ray/data/_internal` to
`ray/_private` so that it can be used in other libraries like Ray Train.

It also adds a new `retry` decorator that wraps around
`call_with_retry`. Note that I had to remove `*` from
`call_with_retry`'s arguments to get the decorator to work on Python
object methods because Python passes `self` as one of the `*args`.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
… to retry checkpoint upload (#56608)

This PR moves `call_with_retry` from `ray/data/_internal` to
`ray/_private` so that it can be used in other libraries like Ray Train.

It also adds a new `retry` decorator that wraps around
`call_with_retry`. Note that I had to remove `*` from
`call_with_retry`'s arguments to get the decorator to work on Python
object methods because Python passes `self` as one of the `*args`.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
… to retry checkpoint upload (ray-project#56608)

This PR moves `call_with_retry` from `ray/data/_internal` to
`ray/_private` so that it can be used in other libraries like Ray Train.

It also adds a new `retry` decorator that wraps around
`call_with_retry`. Note that I had to remove `*` from
`call_with_retry`'s arguments to get the decorator to work on Python
object methods because Python passes `self` as one of the `*args`.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
… to retry checkpoint upload (ray-project#56608)

This PR moves `call_with_retry` from `ray/data/_internal` to
`ray/_private` so that it can be used in other libraries like Ray Train.

It also adds a new `retry` decorator that wraps around
`call_with_retry`. Note that I had to remove `*` from
`call_with_retry`'s arguments to get the decorator to work on Python
object methods because Python passes `self` as one of the `*args`.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
… to retry checkpoint upload (ray-project#56608)

This PR moves `call_with_retry` from `ray/data/_internal` to
`ray/_private` so that it can be used in other libraries like Ray Train.

It also adds a new `retry` decorator that wraps around
`call_with_retry`. Note that I had to remove `*` from
`call_with_retry`'s arguments to get the decorator to work on Python
object methods because Python passes `self` as one of the `*args`.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core data Ray Data-related issues go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants