Skip to content

[Data] Deprecate Dataset.num_blocks() for non-materialized Datasets#43178

Merged
c21 merged 19 commits intoray-project:masterfrom
scottjlee:0214-numblocks
Feb 27, 2024
Merged

[Data] Deprecate Dataset.num_blocks() for non-materialized Datasets#43178
c21 merged 19 commits intoray-project:masterfrom
scottjlee:0214-numblocks

Conversation

@scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Feb 14, 2024

Why are these changes needed?

As part of the API simplification work in preparation for Ray Data GA, we are deprecating the Dataset.num_blocks() method. This method will only be available to MaterializedDatasets, and calling Dataset.num_blocks() on a non-materialized Dataset will result in a NotImplementedError.

Additional context behind the motivation for the change: We want to make Blocks a Ray Data internal concept, so users should typically not need to be concerned with them. Instead, the primary method of choice should be Dataset.count(), which returns the number of rows in the Dataset.

The number of blocks is still available from method of the Dataset's internal ExecutionPlan object: ds._plan.initial_num_blocks().

Related issue number

Closes #42184

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Scott Lee added 3 commits February 14, 2024 12:35
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
# ML training frameworks
xgboost==1.7.6
git+https://github.com/ray-project/xgboost_ray.git
git+https://github.com/ray-project/xgboost_ray@5a840af05d487171883dadbfdd37b138b607bed8#egg=xgboost_ray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change?

return schema.names
return None

def num_blocks(self) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is implicitly a stable public API, right? Should we soft deprecate it before removal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to @raulchen @c21 regarding our deprecation policy vs if we want to fast track this removal before GA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to throw an error for non-materialized datasets

>>> import ray
>>> ds = ray.data.range(100)
>>> ds.repartition(10).num_blocks()
>>> ds.repartition(10)._plan.initial_num_blocks()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want to expose num_blocks to user, I figure we don't want to expose Dataset._plan.initial_num_blocks either?

@@ -988,7 +988,7 @@ def repartition(
Examples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we also planning on removing num_blocks from Dataset.__repr__?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only MaterializedDataset should report num_blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for regular Dataset, should we throw an exception or other alternative behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like throw an exception when you repr? I think we'd just exclude the information from the output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah, i would exclude it from the repr. just meant the general case where we are calling num_blocks()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, exclude num_blocks in repr and throw an error when calling num_blocks(). this sounds most reasonable.

Scott Lee added 2 commits February 23, 2024 21:00
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee requested a review from a team as a code owner February 24, 2024 05:45
Scott Lee added 4 commits February 24, 2024 12:12
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee changed the title [Data] Remove Dataset.num_blocks() [Data] Deprecate Dataset.num_blocks() for non-materialized Datasets Feb 24, 2024
Scott Lee added 2 commits February 24, 2024 14:22
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits and added some cross-referencing.

"""Return the number of blocks of this Dataset.

Note that during read and transform operations, the number of blocks
This is only implemented for :class:`~ray.data.MaterializedDataset`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is only implemented for :class:`~ray.data.MaterializedDataset`,
This method is only implemented for :class:`~ray.data.MaterializedDataset`,

Note that during read and transform operations, the number of blocks
This is only implemented for :class:`~ray.data.MaterializedDataset`,
since the number of blocks may dynamically change during execution.
For instance, during read and transform operations, the number of blocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For instance, during read and transform operations, the number of blocks
For instance, during read and transform operations, Ray Data may dynamically adjust

This is only implemented for :class:`~ray.data.MaterializedDataset`,
since the number of blocks may dynamically change during execution.
For instance, during read and transform operations, the number of blocks
may be dynamically adjusted to respect memory limits, increasing the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
may be dynamically adjusted to respect memory limits, increasing the
the number of blocks to respect memory limits, increasing the

Time complexity: O(1)

Returns:
The number of blocks of this dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The number of blocks of this dataset.
The number of blocks of this :class:`Dataset`.

return self._plan.initial_num_blocks()
raise NotImplementedError(
"Number of blocks is only available for `MaterializedDataset`,"
"since the number of blocks may dynamically change during execution."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"since the number of blocks may dynamically change during execution."
"because the number of blocks may dynamically change during execution."

Time complexity: O(1)

Returns:
The number of blocks of this dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The number of blocks of this dataset.
The number of blocks of this :class:`Dataset`.


def num_blocks(self) -> int:
"""Return the number of blocks of this dataset.
"""Return the number of blocks of this Dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Return the number of blocks of this Dataset.
"""Return the number of blocks of this :class:`Dataset`.


pass
def num_blocks(self) -> int:
"""Return the number of blocks of this MaterializedDataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Return the number of blocks of this MaterializedDataset.
"""Return the number of blocks of this :class:`MaterializedDataset`.

f"Dataset '{dataset_key}' has {dataset_num_blocks} blocks, "
f"which is less than the `num_workers` "
f"{self._ray_params.num_actors}. "
f"This dataset will be automatically repartitioned to "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"This dataset will be automatically repartitioned to "
f"This dataset is automatically repartitioned to "

Scott Lee added 2 commits February 26, 2024 15:58
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@c21 c21 merged commit de08484 into ray-project:master Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[data] deprecate Dataset.num_blocks

5 participants