Skip to content

[data][train] Minor rework of get_dataset_shard#55825

Merged
justinvyu merged 6 commits intomasterfrom
rework_get_dataset_shard
Aug 23, 2025
Merged

[data][train] Minor rework of get_dataset_shard#55825
justinvyu merged 6 commits intomasterfrom
rework_get_dataset_shard

Conversation

@justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Aug 22, 2025

Summary

Adds a slim wrapper around dataset shard iterators passed to each worker.

  • Replaced direct dataset_shards dictionary access with a provider pattern that encapsulates dataset shard retrieval logic
  • Updated DatasetsSetupCallback to return DatasetShardProvider instances instead of raw dataset shards
  • Modified TrainContext and related classes to use the new provider pattern
  • Added DatasetShardMetadata which can be extended for more flexible dataset sharding configuration in the future.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested a review from a team as a code owner August 22, 2025 01:53
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors how dataset shards are provided to workers by introducing a DatasetShardProvider. This is a good abstraction that improves encapsulation. However, I've identified two critical regressions. First, the new implementation no longer evaluates callable datasets, which would break existing user code. Second, the TrainContext no longer correctly handles cases where no datasets are provided, which will lead to a runtime error. I've provided suggestions to fix both of these issues.

@ray-gardener ray-gardener bot added train Ray Train Related Issue data Ray Data-related issues labels Aug 22, 2025
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu
Copy link
Contributor Author

release test to verify no regressions https://buildkite.com/ray-project/release/builds/54747

@justinvyu justinvyu enabled auto-merge (squash) August 22, 2025 23:00
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Aug 22, 2025
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@github-actions github-actions bot disabled auto-merge August 23, 2025 00:18
@justinvyu justinvyu merged commit 729a3c0 into master Aug 23, 2025
5 checks passed
@justinvyu justinvyu deleted the rework_get_dataset_shard branch August 23, 2025 21:41
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
Adds a slim wrapper around dataset shard iterators passed to each
worker.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
Adds a slim wrapper around dataset shard iterators passed to each
worker.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Adds a slim wrapper around dataset shard iterators passed to each
worker.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants