[Data] [2/n] - Add predicate expression support for dataset.filter#56716
Conversation
Signed-off-by: Goutam V. <goutam@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request adds support for predicate expressions to dataset.filter(), which is a great step towards a more unified and expressive API for data manipulation in Ray Data. The implementation is clean, follows existing patterns, and is accompanied by a comprehensive set of tests covering various scenarios. The changes in ArrowBlockAccessor and PandasBlockAccessor are correct, and the refactoring in Dataset.filter improves readability. I have one minor suggestion for code cleanup.
Signed-off-by: Goutam V. <goutam@anyscale.com>
iamjustinhsu
left a comment
There was a problem hiding this comment.
Do u have a PR link to [1/n]? Want to take a look at the Predicate Expression Declaration
|
Signed-off-by: Goutam V. <goutam@anyscale.com>
|
|
||
| def filter(self, predicate_expr: "Expr") -> "pandas.DataFrame": | ||
| """Filter rows based on a predicate expression.""" | ||
| from ray.data._expression_evaluator import eval_expr |
There was a problem hiding this comment.
Let's move this to _internal
There was a problem hiding this comment.
I'll make this a TODO just to keep the change cleaner
There was a problem hiding this comment.
Yeah, let's do in a follow-up. But let's do it right away
Signed-off-by: Goutam V. <goutam@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for predicate expressions to dataset.filter(), which is a great step towards a more unified expression system in Ray Data. The changes are well-structured, with updates to the logical and physical layers, block accessors, and the public Dataset API. The addition of comprehensive tests for the new expression functionality is particularly commendable. I've found one critical issue related to class inheritance in the logical operator definition that needs to be addressed. Otherwise, the implementation looks solid.
|
|
||
| def filter(self, predicate_expr: "Expr") -> "pandas.DataFrame": | ||
| """Filter rows based on a predicate expression.""" | ||
| from ray.data._expression_evaluator import eval_expr |
There was a problem hiding this comment.
Yeah, let's do in a follow-up. But let's do it right away
Signed-off-by: Goutam V. <goutam@anyscale.com>
3d63200 to
7ff4f9e
Compare
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
alexeykudinkin
left a comment
There was a problem hiding this comment.
LGTM, minor comments
python/ray/data/_internal/planner/plan_expression/expression_evaluator.py
Outdated
Show resolved
Hide resolved
| if isinstance(elem, LiteralExpr): | ||
| elements.append(elem.value) | ||
| else: | ||
| # For compatibility with Arrow visitor, we need to support non-literals | ||
| # but Ray Data expressions may have limitations here | ||
| raise ValueError( | ||
| "List contains non-constant expressions. Ray Data expressions " | ||
| "currently only support lists of constant values for 'in' operations." | ||
| ) |
There was a problem hiding this comment.
Wait, you don't know if this list is gonna be used in in, right?
There was a problem hiding this comment.
Good point
python/ray/data/_internal/planner/plan_expression/expression_evaluator.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/planner/plan_expression/expression_evaluator.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/planner/plan_expression/expression_evaluator.py
Show resolved
Hide resolved
Signed-off-by: Goutam V. <goutam@anyscale.com>
python/ray/data/_internal/planner/plan_expression/expression_evaluator.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/planner/plan_expression/expression_evaluator.py
Outdated
Show resolved
Hide resolved
| if isinstance(left_expr, ColumnExpr): | ||
| return col(f"{left_expr._name}.{node.attr}") | ||
|
|
||
| raise ValueError(f"Unsupported attribute access: {node.attr}") |
There was a problem hiding this comment.
This is not gonna be enough for us to debug it, right?
Add the log of the whole node, plus expr we parsed
There was a problem hiding this comment.
I'll use ast.dump on the node
…ay-project#56716) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ay-project#56716) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…ay-project#56716) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Goutam V. <goutam@anyscale.com>
…ay-project#56716) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
…ay-project#56716) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Goutam V. <goutam@anyscale.com>
Original PR #56716 by goutamvenkat-anyscale Original: ray-project/ray#56716
…t for dataset.filter Merged from original PR #56716 Original: ray-project/ray#56716
…ay-project#56716) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Goutam V. <goutam@anyscale.com>
…ay-project#56716) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…ay-project#56716) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression.
Follow up: Deprecate the usage of pyarrow's expression system.
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Adds native predicate-expression filtering to
Dataset.filter(with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline.Dataset.filter: addexpr: Union[str, Expr]; deprecate string expressions with warning; validate mutual exclusivity withfn; resource arg handling unchanged.Filter: acceptpredicate_exprorfn(exactly one); remove direct pyarrow expression dependency.BlockMapTransformFn; UDF path unchanged.ArrowBlockAccessor.filterandPandasBlockAccessor.filter: evaluateExprviaeval_exprto produce boolean mask, then filter table/dataframe.ExpressionEvaluator.parse_native_expression) and_ConvertToNativeExpressionVisitorto convert string filters to RayExpr.tests/test_filter.pycovering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to newFiltersignature; add Bazel targettest_filter.Written by Cursor Bugbot for commit 633d5e2. This will update automatically on new commits. Configure here.