Skip to content

[data][train] Create a deepcopy of the data context on the split coordinator process#56211

Merged
justinvyu merged 7 commits intoray-project:masterfrom
justinvyu:ds_callback_run_context
Sep 4, 2025
Merged

[data][train] Create a deepcopy of the data context on the split coordinator process#56211
justinvyu merged 7 commits intoray-project:masterfrom
justinvyu:ds_callback_run_context

Conversation

@justinvyu
Copy link
Contributor

Summary

The main change of this PR is to create a deepcopy of the base dataset's context before setting the process-global context.

Otherwise, mutations to the base dataset's context during the planning phase are also propagated to the global context, which can affect future dataset executions launched from the same process.

Misc. drive-by changes

  • Utility to create a StorageContext from the RunConfig directly
  • Pipe the DatasetShardMetadata from the outermost level among other changes, for easier patching

… from runconfig

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested review from a team as code owners September 3, 2025 22:49
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a key fix to prevent state leakage in DataContext by using a deep copy. It also includes several nice refactorings, such as simplifying DatasetsSetupCallback's initialization by using TrainRunContext, and centralizing StorageContext creation within RunConfig.

I have a few suggestions:

  • A critical issue where a test is broken due to the refactoring of DatasetsSetupCallback.
  • A high-severity suggestion to improve the performance of the new storage_context property by caching its result.
  • A medium-severity suggestion to restore a helpful note in a docstring for better code maintainability.

Overall, the changes are well-structured and improve the codebase. Addressing the identified issues will make this PR even better.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
self._scaling_config = scaling_config
def __init__(self, train_run_context: TrainRunContext):
self._datasets = train_run_context.datasets
self._data_config = copy.deepcopy(train_run_context.dataset_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be deepcopied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to deepcopy to avoid modifying the user's configs in place

@ray-gardener ray-gardener bot added train Ray Train Related Issue data Ray Data-related issues labels Sep 4, 2025
Copy link
Contributor

@srinathk10 srinathk10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…allback_run_context

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu enabled auto-merge (squash) September 4, 2025 07:43
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 4, 2025
@justinvyu justinvyu merged commit 6f3689a into ray-project:master Sep 4, 2025
6 of 7 checks passed
@justinvyu justinvyu deleted the ds_callback_run_context branch September 4, 2025 15:43
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
…dinator process (ray-project#56211)

The main change of this PR is to create a deepcopy of the base dataset's
context before setting the process-global context.

Otherwise, mutations to the base dataset's context during the planning
phase are also propagated to the global context, which can affect future
dataset executions launched from the same process.

Misc. drive-by changes:
* Utility to create a `StorageContext` from the `RunConfig` directly
* Pipe the `DatasetShardMetadata` from the outermost level among other
changes, for easier patching

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…dinator process (ray-project#56211)

The main change of this PR is to create a deepcopy of the base dataset's
context before setting the process-global context.

Otherwise, mutations to the base dataset's context during the planning
phase are also propagated to the global context, which can affect future
dataset executions launched from the same process.

Misc. drive-by changes:
* Utility to create a `StorageContext` from the `RunConfig` directly
* Pipe the `DatasetShardMetadata` from the outermost level among other
changes, for easier patching

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
…dinator process (ray-project#56211)

The main change of this PR is to create a deepcopy of the base dataset's
context before setting the process-global context.

Otherwise, mutations to the base dataset's context during the planning
phase are also propagated to the global context, which can affect future
dataset executions launched from the same process.

Misc. drive-by changes:
* Utility to create a `StorageContext` from the `RunConfig` directly
* Pipe the `DatasetShardMetadata` from the outermost level among other
changes, for easier patching

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…dinator process (#56211)

The main change of this PR is to create a deepcopy of the base dataset's
context before setting the process-global context.

Otherwise, mutations to the base dataset's context during the planning
phase are also propagated to the global context, which can affect future
dataset executions launched from the same process.

Misc. drive-by changes:
* Utility to create a `StorageContext` from the `RunConfig` directly
* Pipe the `DatasetShardMetadata` from the outermost level among other
changes, for easier patching

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…dinator process (ray-project#56211)

The main change of this PR is to create a deepcopy of the base dataset's
context before setting the process-global context.

Otherwise, mutations to the base dataset's context during the planning
phase are also propagated to the global context, which can affect future
dataset executions launched from the same process.

Misc. drive-by changes:
* Utility to create a `StorageContext` from the `RunConfig` directly
* Pipe the `DatasetShardMetadata` from the outermost level among other
changes, for easier patching

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants