Added Ray Train & Pytorch Lightning demo#559
Added Ray Train & Pytorch Lightning demo#559Bobbins228 wants to merge 1 commit intoproject-codeflare:mainfrom
Conversation
There was a problem hiding this comment.
Are we sure the data is shared across workers?
There was a problem hiding this comment.
Looked into this and I would say probably not after finding out that the DistributedSampler exists.
I will update this script and the llama2 one to make use of the DistrbutedSampler 👍
b29c031 to
705e0cf
Compare
705e0cf to
29baf39
Compare
| # Based on https://docs.ray.io/en/latest/train/getting-started-pytorch-lightning.html | ||
|
|
||
| """ | ||
| Note: This example requires an S3 compatible storage bucket for distributed training. Please visit our documentation for more information -> https://github.com/project-codeflare/codeflare-sdk/blob/main/docs/s3-compatible-storage.md |
There was a problem hiding this comment.
How do I configure what path to actually use within the bucket for the distributed training?
There was a problem hiding this comment.
I created my own bucket with its own path via AWS and gathered the URI using the UI.
s3://mark-bucket/data/
I was not aware we had a shared bucket but you could create a new folder within it and then copy the URI from there.
varshaprasad96
left a comment
There was a problem hiding this comment.
I made a couple of changes locally to be able to run the content in the notebooks locally, but I'm sure its me missing up with something in the setup 😅
Going to /lgtm and /approve this from my end since it works overall and my workarounds are unrelated! :))
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: varshaprasad96 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@varshaprasad96 The only changes needed would have been the addition of S3 or minio storage. Is that what you had to change? |
That, and I'm not exactly sure of the right steps to be able to run these notebooks. I had to create a separate venv, install all the deps, change references to import and run this. Is there something I was missing while configuring to be able to reproduce the demos? |
|
On RHOAI in your workbench you should be able to clone the repo and this PR branch via a terminal. |
|
I see! I had been using an ROSA cluster, manually installing the components (not through OpenShift AI operator) and trying to run the examples. This seems similar to what you mentioned. Will check it out again! |
Issue link
RHOAIENG-7805
What changes have been made
Added a demo notebook and python script based on the Ray Train & Pytorch Lightning example provided by Ray.
Verification steps
Setup
Notebook server ODH/RHOAI/Local
git clone https://github.com/project-codeflare/codeflare-sdk.gitpip install codeflare-sdkTesting
Run through the entire demo notebook.
Test the minio and S3 persistent storage examples separately by following the comments in
pytorch_lightning.pyA few things to note:
Checks