Skip to content

[Train][Tune] Support reading train result from cloud storage#40622

Merged
justinvyu merged 12 commits intoray-project:masterfrom
CodecLondon:result-from-cloud-storage
Nov 17, 2023
Merged

[Train][Tune] Support reading train result from cloud storage#40622
justinvyu merged 12 commits intoray-project:masterfrom
CodecLondon:result-from-cloud-storage

Conversation

@ahmed-mahran
Copy link
Contributor

@ahmed-mahran ahmed-mahran commented Oct 24, 2023

This is to support reading Result object from remote/cloud storage systems like s3 and gcs.

Why are these changes needed?

Result can already be stored in remote/cloud storage. There should be a functionality to read it back.

result = Result.from_path("s3://storage_path/experiment_dir/trial_dir/", storage_filesystem=...)
# result.best_checkpoints

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

This is to support reading Result object from remote storage systems like s3 and gcs.
@ahmed-mahran
Copy link
Contributor Author

@woshiyyya would you take a look please

@ahmed-mahran
Copy link
Contributor Author

@justinvyu @krfricke

Signed-off-by: Ahmed Mahran <ahmahran@gmail.com>
@justinvyu
Copy link
Contributor

Thanks, this is looking really good! I'll do a review sometime later today.

@justinvyu justinvyu self-assigned this Nov 7, 2023
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great PR 🤩

I have a few comments, and I can help you out with it.

@ahmed-mahran
Copy link
Contributor Author

Thanks @justinvyu for the thorough review! I'll try to address your comments early next week.

@ahmed-mahran ahmed-mahran force-pushed the result-from-cloud-storage branch from 7c15e5c to 36ce2cd Compare November 13, 2023 20:26
@ahmed-mahran
Copy link
Contributor Author

@justinvyu, I think I've addressed your feedback.

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Apart from needing to fix the lint errors, I just have some small comments.

Comment on lines 310 to 316
def _read_file_as_str(
storage_filesystem: pyarrow.fs.FileSystem,
storage_path: str,
compression: Optional[str] = "detect",
buffer_size: Optional[int] = None,
encoding: Optional[str] = "utf-8",
) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm okay with adding this helper, but also it might be clearer to just do the filesystem operations directly? We can add this in the future if it turns out we need to read a file as text directly from cloud very often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be cleaner and clearer to do it this way especially the logic is used twice (for json and csv). I was going to make it inner function in from_path function. I can move as a static method in Result like the old _validate_trial_dir.

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Approved and pending tests passing. 🙌

@justinvyu
Copy link
Contributor

Ok, the mock_s3_bucket_uri fixture is actually not in the conftest for ray.air.tests. Let's move this unit test to ray.train.tests instead and add the conftest as a dependency.

  1. Move
    py_test(
    to https://github.com/ray-project/ray/blob/master/python/ray/train/BUILD (sorted by alphabetical order)
  2. Add conftest as a dependency:
py_test(
    name = "test_result",
    size = "medium",
    srcs = ["tests/test_result.py"],
    tags = ["team:ml", "exclusive"],
    deps = [":ml_lib", ":conftest"]
)

ray-project#40622 (comment)

The test requires mock_s3_bucket_uri fixture which is in conftest for ray.train.tests
@ahmed-mahran ahmed-mahran force-pushed the result-from-cloud-storage branch from 9b052a4 to 7198c11 Compare November 16, 2023 07:53
@justinvyu
Copy link
Contributor

Sorry, it is :train_lib instead of :ml_lib 😅

@justinvyu justinvyu merged commit 7c7eaf2 into ray-project:master Nov 17, 2023
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023
…oject#40622)

This is to support reading Result object from remote/cloud storage systems like s3 and gcs.

---------

Signed-off-by: Ahmed Mahran <ahmahran@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants