Skip to content

[Train] Add a barrier in RayTrainReportCallback to ensure synchronous reporting.#40875

Merged
matthewdeng merged 1 commit intoray-project:masterfrom
woshiyyya:train/fix_lightning_report_callback
Nov 3, 2023
Merged

[Train] Add a barrier in RayTrainReportCallback to ensure synchronous reporting.#40875
matthewdeng merged 1 commit intoray-project:masterfrom
woshiyyya:train/fix_lightning_report_callback

Conversation

@woshiyyya
Copy link
Member

Why are these changes needed?

ray.train.report ensures that all workers enter at the same time, but there are no barriers preventing all workers from exiting at the same time. This could lead to an FileNotFoundError in RayTrainReportCallback, where local_rank_0 worker exit earlier, deleted the ckpt folder, while other workers are still uploading it.

This PR fixed this bug by adding a barrier call right after ray.train.report.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@woshiyyya woshiyyya marked this pull request as ready for review November 2, 2023 01:26
@woshiyyya woshiyyya requested a review from matthewdeng November 2, 2023 01:26
@woshiyyya woshiyyya requested a review from justinvyu November 2, 2023 01:27
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya force-pushed the train/fix_lightning_report_callback branch from 69756f4 to ab71a07 Compare November 2, 2023 01:29
@matthewdeng matthewdeng merged commit c1e387f into ray-project:master Nov 3, 2023
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023
… reporting. (ray-project#40875)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants