Skip to content

[train][release] Attach a quick checkpoint when reporting metrics#56718

Merged
justinvyu merged 8 commits intoray-project:masterfrom
liulehui:v2_migration
Sep 24, 2025
Merged

[train][release] Attach a quick checkpoint when reporting metrics#56718
justinvyu merged 8 commits intoray-project:masterfrom
liulehui:v2_migration

Conversation

@liulehui
Copy link
Contributor

@liulehui liulehui commented Sep 18, 2025

  1. In Train V2, Free-floating metrics without corresponding checkpoint are no longer automatically saved. See context in https://docs.ray.io/en/master/train/user-guides/monitoring-logging.html#deprecated-reporting-free-floating-metrics.
  2. By creating a quick checkpoint, we can leverage Train V2 by adding RAY_TRAIN_V2_ENABLED=1.
  3. Since in V2, we have a 2s polling interval for train.report() and in this test we only care about final loss, move the metrics reporting part out of epoch loop, only report once at the end of training loop.
  4. example run: https://buildkite.com/ray-project/release/builds/59517

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the torch benchmark to be compatible with Ray Train V2 by attaching a checkpoint when reporting metrics. However, the current implementation has a critical issue where train.report() is only called by the rank 0 worker. In a distributed setting, all workers must call train.report(), otherwise the job will likely hang. I've provided a suggestion to fix this by moving the train.report() call out of the rank-check conditional block, ensuring all workers participate in reporting.

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
@ray-gardener ray-gardener bot added train Ray Train Related Issue release-test release test labels Sep 19, 2025
Signed-off-by: Lehui Liu <lehui@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Lehui Liu <lehui@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Lehui Liu <lehui@anyscale.com>
@liulehui liulehui requested a review from a team September 22, 2025 20:29
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Signed-off-by: Lehui Liu <lehui@anyscale.com>
@liulehui liulehui added the go add ONLY when ready to merge, run all tests label Sep 23, 2025
@justinvyu justinvyu merged commit 84e70db into ray-project:master Sep 24, 2025
7 checks passed
elliot-barn pushed a commit that referenced this pull request Sep 27, 2025
…6718)

In Train V2, Free-floating metrics without corresponding checkpoint
are no longer automatically saved, so attach a checkpoint to metrics,
so that we can emit some release test metrics such as the local time
taken by each train worker.

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…6718)

In Train V2, Free-floating metrics without corresponding checkpoint
are no longer automatically saved, so attach a checkpoint to metrics,
so that we can emit some release test metrics such as the local time
taken by each train worker.

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…y-project#56718)

In Train V2, Free-floating metrics without corresponding checkpoint
are no longer automatically saved, so attach a checkpoint to metrics,
so that we can emit some release test metrics such as the local time
taken by each train worker.

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…y-project#56718)

In Train V2, Free-floating metrics without corresponding checkpoint
are no longer automatically saved, so attach a checkpoint to metrics,
so that we can emit some release test metrics such as the local time
taken by each train worker.

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…y-project#56718)

In Train V2, Free-floating metrics without corresponding checkpoint
are no longer automatically saved, so attach a checkpoint to metrics,
so that we can emit some release test metrics such as the local time
taken by each train worker.

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests release-test release test train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants