[Data] Deprecate `Dataset.num_blocks()` for non-materialized `Dataset`s by scottjlee · Pull Request #43178 · ray-project/ray

scottjlee · 2024-02-14T20:41:32Z

Why are these changes needed?

As part of the API simplification work in preparation for Ray Data GA, we are deprecating the Dataset.num_blocks() method. This method will only be available to MaterializedDatasets, and calling Dataset.num_blocks() on a non-materialized Dataset will result in a NotImplementedError.

Additional context behind the motivation for the change: We want to make Blocks a Ray Data internal concept, so users should typically not need to be concerned with them. Instead, the primary method of choice should be Dataset.count(), which returns the number of rows in the Dataset.

The number of blocks is still available from method of the Dataset's internal ExecutionPlan object: ds._plan.initial_num_blocks().

Related issue number

Closes #42184

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

…o 0214-numblocks

Signed-off-by: Scott Lee <sjl@anyscale.com>

bveeramani · 2024-02-16T21:33:52Z

python/requirements/ml/core-requirements.txt

 # ML training frameworks
 xgboost==1.7.6
-git+https://github.com/ray-project/xgboost_ray.git
+git+https://github.com/ray-project/xgboost_ray@5a840af05d487171883dadbfdd37b138b607bed8#egg=xgboost_ray


Unrelated change?

bveeramani · 2024-02-20T19:52:08Z

python/ray/data/dataset.py

            return schema.names
        return None

-    def num_blocks(self) -> int:


This API is implicitly a stable public API, right? Should we soft deprecate it before removal?

up to @raulchen @c21 regarding our deprecation policy vs if we want to fast track this removal before GA.

I prefer to throw an error for non-materialized datasets

bveeramani · 2024-02-20T19:54:34Z

python/ray/data/dataset.py

            >>> import ray
            >>> ds = ray.data.range(100)
-            >>> ds.repartition(10).num_blocks()
+            >>> ds.repartition(10)._plan.initial_num_blocks()


If we don't want to expose num_blocks to user, I figure we don't want to expose Dataset._plan.initial_num_blocks either?

bveeramani · 2024-02-20T19:58:53Z

python/ray/data/dataset.py

@@ -988,7 +988,7 @@ def repartition(
        Examples:


Are we also planning on removing num_blocks from Dataset.__repr__?

I think only MaterializedDataset should report num_blocks.

for regular Dataset, should we throw an exception or other alternative behavior?

Like throw an exception when you repr? I think we'd just exclude the information from the output

oh yeah, i would exclude it from the repr. just meant the general case where we are calling num_blocks()

yeah, exclude num_blocks in repr and throw an error when calling num_blocks(). this sounds most reasonable.

Signed-off-by: Scott Lee <sjl@anyscale.com>

angelinalg

Just some nits and added some cross-referencing.

angelinalg · 2024-02-26T23:28:40Z

python/ray/data/dataset.py

+        """Return the number of blocks of this Dataset.

-        Note that during read and transform operations, the number of blocks
+        This is only implemented for :class:`~ray.data.MaterializedDataset`,


Suggested change

This is only implemented for :class:`~ray.data.MaterializedDataset`,

This method is only implemented for :class:`~ray.data.MaterializedDataset`,

angelinalg · 2024-02-26T23:29:12Z

python/ray/data/dataset.py

-        Note that during read and transform operations, the number of blocks
+        This is only implemented for :class:`~ray.data.MaterializedDataset`,
+        since the number of blocks may dynamically change during execution.
+        For instance, during read and transform operations, the number of blocks


Suggested change

For instance, during read and transform operations, the number of blocks

For instance, during read and transform operations, Ray Data may dynamically adjust

angelinalg · 2024-02-26T23:29:23Z

python/ray/data/dataset.py

+        This is only implemented for :class:`~ray.data.MaterializedDataset`,
+        since the number of blocks may dynamically change during execution.
+        For instance, during read and transform operations, the number of blocks
        may be dynamically adjusted to respect memory limits, increasing the


Suggested change

may be dynamically adjusted to respect memory limits, increasing the

the number of blocks to respect memory limits, increasing the

angelinalg · 2024-02-26T23:29:37Z

python/ray/data/dataset.py

-        Time complexity: O(1)
-
        Returns:
            The number of blocks of this dataset.


Suggested change

The number of blocks of this dataset.

The number of blocks of this :class:`Dataset`.

angelinalg · 2024-02-26T23:29:51Z

python/ray/data/dataset.py

-        return self._plan.initial_num_blocks()
+        raise NotImplementedError(
+            "Number of blocks is only available for `MaterializedDataset`,"
+            "since the number of blocks may dynamically change during execution."


Suggested change

"since the number of blocks may dynamically change during execution."

"because the number of blocks may dynamically change during execution."

angelinalg · 2024-02-26T23:30:25Z

python/ray/data/dataset.py

+        Time complexity: O(1)
+
+        Returns:
+            The number of blocks of this dataset.


Suggested change

The number of blocks of this dataset.

The number of blocks of this :class:`Dataset`.

angelinalg · 2024-02-26T23:34:50Z

python/ray/data/dataset.py


    def num_blocks(self) -> int:
-        """Return the number of blocks of this dataset.
+        """Return the number of blocks of this Dataset.


Suggested change

"""Return the number of blocks of this Dataset.

"""Return the number of blocks of this :class:`Dataset`.

angelinalg · 2024-02-26T23:35:38Z

python/ray/data/dataset.py


-    pass
+    def num_blocks(self) -> int:
+        """Return the number of blocks of this MaterializedDataset.


Suggested change

"""Return the number of blocks of this MaterializedDataset.

"""Return the number of blocks of this :class:`MaterializedDataset`.

angelinalg · 2024-02-26T23:36:48Z

python/ray/train/gbdt_trainer.py

+                        f"Dataset '{dataset_key}' has {dataset_num_blocks} blocks, "
                        f"which is less than the `num_workers` "
                        f"{self._ray_params.num_actors}. "
                        f"This dataset will be automatically repartitioned to "


Suggested change

f"This dataset will be automatically repartitioned to "

f"This dataset is automatically repartitioned to "

Signed-off-by: Scott Lee <sjl@anyscale.com>

Scott Lee added 3 commits February 14, 2024 12:35

remove Dataset.num_blocks

8ec0511

Signed-off-by: Scott Lee <sjl@anyscale.com>

replace in tests

3c3eae9

Signed-off-by: Scott Lee <sjl@anyscale.com>

update docs

ea4e4a6

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee mentioned this pull request Feb 14, 2024

[Data] Remove Dataset.num_blocks() usages ray-project/xgboost_ray#307

Merged

scottjlee and others added 6 commits February 14, 2024 17:03

Merge branch 'master' into 0214-numblocks

ef559e0

Merge branch 'master' into 0214-numblocks

670e6cf

pin xgboost_ray to fix

122555c

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0214-numblocks

ba11ff6

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch '0214-numblocks' of https://github.com/scottjlee/ray int…

8810f64

…o 0214-numblocks

pin xgboostray/lightgbmray

5c4d48a

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review February 16, 2024 20:14

scottjlee requested review from amogkam, bveeramani, c21, ericl, omatthew98, raulchen, scv119 and stephanie-wang as code owners February 16, 2024 20:14

scottjlee assigned raulchen and bveeramani Feb 16, 2024

bveeramani reviewed Feb 20, 2024

View reviewed changes

Scott Lee added 2 commits February 23, 2024 21:00

Merge branch 'master' into 0214-numblocks

1809a9e

Signed-off-by: Scott Lee <sjl@anyscale.com>

keep num_blocks() for MaterializedDataset

fc142b9

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested a review from a team as a code owner February 24, 2024 05:45

Scott Lee added 4 commits February 24, 2024 12:12

tests

72c30db

Signed-off-by: Scott Lee <sjl@anyscale.com>

lint

1a23a73

Signed-off-by: Scott Lee <sjl@anyscale.com>

update doctests

514d7b0

Signed-off-by: Scott Lee <sjl@anyscale.com>

undo ml requirements changes

ab7384c

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee changed the title ~~[Data] Remove Dataset.num_blocks()~~ [Data] Deprecate Dataset.num_blocks() for non-materialized Datasets Feb 24, 2024

Scott Lee added 2 commits February 24, 2024 14:22

format

b71b612

Signed-off-by: Scott Lee <sjl@anyscale.com>

avoid passing dataset object

4f2403b

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen approved these changes Feb 26, 2024

View reviewed changes

c21 assigned angelinalg Feb 26, 2024

angelinalg approved these changes Feb 26, 2024

View reviewed changes

Scott Lee added 2 commits February 26, 2024 15:58

address docs comments

193c6a7

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0214-numblocks

053dc94

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 merged commit de08484 into ray-project:master Feb 27, 2024

	This is only implemented for :class:`~ray.data.MaterializedDataset`,
	This method is only implemented for :class:`~ray.data.MaterializedDataset`,

	For instance, during read and transform operations, the number of blocks
	For instance, during read and transform operations, Ray Data may dynamically adjust

	may be dynamically adjusted to respect memory limits, increasing the
	the number of blocks to respect memory limits, increasing the

	The number of blocks of this dataset.
	The number of blocks of this :class:`Dataset`.

	"since the number of blocks may dynamically change during execution."
	"because the number of blocks may dynamically change during execution."

	"""Return the number of blocks of this Dataset.
	"""Return the number of blocks of this :class:`Dataset`.

	"""Return the number of blocks of this MaterializedDataset.
	"""Return the number of blocks of this :class:`MaterializedDataset`.

	f"This dataset will be automatically repartitioned to "
	f"This dataset is automatically repartitioned to "

Conversation

scottjlee commented Feb 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

angelinalg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

scottjlee commented Feb 14, 2024 •

edited

Loading