do not submit - just provide comparison baseline#63388
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request modifies the ray.train.DataConfig in ray_dataloader_factory.py by hardcoding datasets_to_split to an empty list. Feedback indicates that this change disables default sharding, which may lead to incorrect performance metrics or resource issues in distributed training; it is recommended to make this setting configurable or provide a clear explanation for disabling sharding.
|
|
||
| def get_ray_data_config(self) -> ray.train.DataConfig: | ||
| return ray.train.DataConfig( | ||
| datasets_to_split=[], |
There was a problem hiding this comment.
Hardcoding datasets_to_split=[] disables the default sharding behavior in Ray Train. This causes every training worker to process the entire dataset rather than a shard, which is generally not the intended behavior for distributed training benchmarks and can lead to excessive resource consumption or incorrect performance metrics. If this is for a specific baseline comparison, it would be better to make this configurable in RayDataConfig or add a comment explaining why sharding is being disabled.
See #63309 for more details