[Data] Deprecate Dataset.num_blocks() for non-materialized Datasets#43178
[Data] Deprecate Dataset.num_blocks() for non-materialized Datasets#43178c21 merged 19 commits intoray-project:masterfrom
Dataset.num_blocks() for non-materialized Datasets#43178Conversation
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
| # ML training frameworks | ||
| xgboost==1.7.6 | ||
| git+https://github.com/ray-project/xgboost_ray.git | ||
| git+https://github.com/ray-project/xgboost_ray@5a840af05d487171883dadbfdd37b138b607bed8#egg=xgboost_ray |
| return schema.names | ||
| return None | ||
|
|
||
| def num_blocks(self) -> int: |
There was a problem hiding this comment.
This API is implicitly a stable public API, right? Should we soft deprecate it before removal?
There was a problem hiding this comment.
I prefer to throw an error for non-materialized datasets
python/ray/data/dataset.py
Outdated
| >>> import ray | ||
| >>> ds = ray.data.range(100) | ||
| >>> ds.repartition(10).num_blocks() | ||
| >>> ds.repartition(10)._plan.initial_num_blocks() |
There was a problem hiding this comment.
If we don't want to expose num_blocks to user, I figure we don't want to expose Dataset._plan.initial_num_blocks either?
| @@ -988,7 +988,7 @@ def repartition( | |||
| Examples: | |||
There was a problem hiding this comment.
Are we also planning on removing num_blocks from Dataset.__repr__?
There was a problem hiding this comment.
I think only MaterializedDataset should report num_blocks.
There was a problem hiding this comment.
for regular Dataset, should we throw an exception or other alternative behavior?
There was a problem hiding this comment.
Like throw an exception when you repr? I think we'd just exclude the information from the output
There was a problem hiding this comment.
oh yeah, i would exclude it from the repr. just meant the general case where we are calling num_blocks()
There was a problem hiding this comment.
yeah, exclude num_blocks in repr and throw an error when calling num_blocks(). this sounds most reasonable.
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Dataset.num_blocks()Dataset.num_blocks() for non-materialized Datasets
Signed-off-by: Scott Lee <sjl@anyscale.com>
angelinalg
left a comment
There was a problem hiding this comment.
Just some nits and added some cross-referencing.
python/ray/data/dataset.py
Outdated
| """Return the number of blocks of this Dataset. | ||
|
|
||
| Note that during read and transform operations, the number of blocks | ||
| This is only implemented for :class:`~ray.data.MaterializedDataset`, |
There was a problem hiding this comment.
| This is only implemented for :class:`~ray.data.MaterializedDataset`, | |
| This method is only implemented for :class:`~ray.data.MaterializedDataset`, |
python/ray/data/dataset.py
Outdated
| Note that during read and transform operations, the number of blocks | ||
| This is only implemented for :class:`~ray.data.MaterializedDataset`, | ||
| since the number of blocks may dynamically change during execution. | ||
| For instance, during read and transform operations, the number of blocks |
There was a problem hiding this comment.
| For instance, during read and transform operations, the number of blocks | |
| For instance, during read and transform operations, Ray Data may dynamically adjust |
python/ray/data/dataset.py
Outdated
| This is only implemented for :class:`~ray.data.MaterializedDataset`, | ||
| since the number of blocks may dynamically change during execution. | ||
| For instance, during read and transform operations, the number of blocks | ||
| may be dynamically adjusted to respect memory limits, increasing the |
There was a problem hiding this comment.
| may be dynamically adjusted to respect memory limits, increasing the | |
| the number of blocks to respect memory limits, increasing the |
python/ray/data/dataset.py
Outdated
| Time complexity: O(1) | ||
|
|
||
| Returns: | ||
| The number of blocks of this dataset. |
There was a problem hiding this comment.
| The number of blocks of this dataset. | |
| The number of blocks of this :class:`Dataset`. |
python/ray/data/dataset.py
Outdated
| return self._plan.initial_num_blocks() | ||
| raise NotImplementedError( | ||
| "Number of blocks is only available for `MaterializedDataset`," | ||
| "since the number of blocks may dynamically change during execution." |
There was a problem hiding this comment.
| "since the number of blocks may dynamically change during execution." | |
| "because the number of blocks may dynamically change during execution." |
python/ray/data/dataset.py
Outdated
| Time complexity: O(1) | ||
|
|
||
| Returns: | ||
| The number of blocks of this dataset. |
There was a problem hiding this comment.
| The number of blocks of this dataset. | |
| The number of blocks of this :class:`Dataset`. |
python/ray/data/dataset.py
Outdated
|
|
||
| def num_blocks(self) -> int: | ||
| """Return the number of blocks of this dataset. | ||
| """Return the number of blocks of this Dataset. |
There was a problem hiding this comment.
| """Return the number of blocks of this Dataset. | |
| """Return the number of blocks of this :class:`Dataset`. |
python/ray/data/dataset.py
Outdated
|
|
||
| pass | ||
| def num_blocks(self) -> int: | ||
| """Return the number of blocks of this MaterializedDataset. |
There was a problem hiding this comment.
| """Return the number of blocks of this MaterializedDataset. | |
| """Return the number of blocks of this :class:`MaterializedDataset`. |
python/ray/train/gbdt_trainer.py
Outdated
| f"Dataset '{dataset_key}' has {dataset_num_blocks} blocks, " | ||
| f"which is less than the `num_workers` " | ||
| f"{self._ray_params.num_actors}. " | ||
| f"This dataset will be automatically repartitioned to " |
There was a problem hiding this comment.
| f"This dataset will be automatically repartitioned to " | |
| f"This dataset is automatically repartitioned to " |
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Why are these changes needed?
As part of the API simplification work in preparation for Ray Data GA, we are deprecating the
Dataset.num_blocks()method. This method will only be available toMaterializedDatasets, and callingDataset.num_blocks()on a non-materialized Dataset will result in aNotImplementedError.Additional context behind the motivation for the change: We want to make
Blocks a Ray Data internal concept, so users should typically not need to be concerned with them. Instead, the primary method of choice should beDataset.count(), which returns the number of rows in the Dataset.The number of blocks is still available from method of the Dataset's internal
ExecutionPlanobject:ds._plan.initial_num_blocks().Related issue number
Closes #42184
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.