[data][train] Refactor call_with_retry into shared library and use it to retry checkpoint upload#56608
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request effectively refactors the retry logic into a shared utility call_with_retry and introduces a convenient @retry decorator. The changes improve code modularity and reusability. The new decorator is correctly applied to add resilience to the checkpoint uploading process in Ray Train. I have a couple of suggestions to improve the logging clarity and documentation of the new retry utility.
|
utilities used by multiple libraries should live in please move it there and add standalone tests for the utility |
| ) | ||
| ) | ||
|
|
||
| @retry(description="upload checkpoint", max_attempts=3) |
There was a problem hiding this comment.
should we make this 3 configurable? also, I think we should match specific "upload errors" similar to data's whitelist:
Otherwise we'll retry on the other "unretryable" errors that we raise explicitly already in _upload_checkpoint.
There was a problem hiding this comment.
Added TODO for retry configurability: could be a good extension to #55861.
Added COMMON_RETRYABLE_TOKENS lmk if that's fine.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
… to retry checkpoint upload (ray-project#56608) This PR moves `call_with_retry` from `ray/data/_internal` to `ray/_private` so that it can be used in other libraries like Ray Train. It also adds a new `retry` decorator that wraps around `call_with_retry`. Note that I had to remove `*` from `call_with_retry`'s arguments to get the decorator to work on Python object methods because Python passes `self` as one of the `*args`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: zac <zac@anyscale.com>
… to retry checkpoint upload (#56608) This PR moves `call_with_retry` from `ray/data/_internal` to `ray/_private` so that it can be used in other libraries like Ray Train. It also adds a new `retry` decorator that wraps around `call_with_retry`. Note that I had to remove `*` from `call_with_retry`'s arguments to get the decorator to work on Python object methods because Python passes `self` as one of the `*args`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
… to retry checkpoint upload (ray-project#56608) This PR moves `call_with_retry` from `ray/data/_internal` to `ray/_private` so that it can be used in other libraries like Ray Train. It also adds a new `retry` decorator that wraps around `call_with_retry`. Note that I had to remove `*` from `call_with_retry`'s arguments to get the decorator to work on Python object methods because Python passes `self` as one of the `*args`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Marco Stephan <marco@magic.dev>
… to retry checkpoint upload (#56608) This PR moves `call_with_retry` from `ray/data/_internal` to `ray/_private` so that it can be used in other libraries like Ray Train. It also adds a new `retry` decorator that wraps around `call_with_retry`. Note that I had to remove `*` from `call_with_retry`'s arguments to get the decorator to work on Python object methods because Python passes `self` as one of the `*args`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
… to retry checkpoint upload (#56608) This PR moves `call_with_retry` from `ray/data/_internal` to `ray/_private` so that it can be used in other libraries like Ray Train. It also adds a new `retry` decorator that wraps around `call_with_retry`. Note that I had to remove `*` from `call_with_retry`'s arguments to get the decorator to work on Python object methods because Python passes `self` as one of the `*args`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
… to retry checkpoint upload (ray-project#56608) This PR moves `call_with_retry` from `ray/data/_internal` to `ray/_private` so that it can be used in other libraries like Ray Train. It also adds a new `retry` decorator that wraps around `call_with_retry`. Note that I had to remove `*` from `call_with_retry`'s arguments to get the decorator to work on Python object methods because Python passes `self` as one of the `*args`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
… to retry checkpoint upload (ray-project#56608) This PR moves `call_with_retry` from `ray/data/_internal` to `ray/_private` so that it can be used in other libraries like Ray Train. It also adds a new `retry` decorator that wraps around `call_with_retry`. Note that I had to remove `*` from `call_with_retry`'s arguments to get the decorator to work on Python object methods because Python passes `self` as one of the `*args`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
… to retry checkpoint upload (ray-project#56608) This PR moves `call_with_retry` from `ray/data/_internal` to `ray/_private` so that it can be used in other libraries like Ray Train. It also adds a new `retry` decorator that wraps around `call_with_retry`. Note that I had to remove `*` from `call_with_retry`'s arguments to get the decorator to work on Python object methods because Python passes `self` as one of the `*args`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Summary
This PR moves
call_with_retryfromray/data/_internaltoray/_privateso that it can be used in other libraries like Ray Train.It also adds a new
retrydecorator that wraps aroundcall_with_retry. Note that I had to remove*fromcall_with_retry's arguments to get the decorator to work on Python object methods because Python passesselfas one of the*args.Finally, it uses this decorator to retry checkpoint uploads.
Testing
Unit tests