[Data] [2/n] - Add predicate expression support for dataset.filter by goutamvenkat-anyscale · Pull Request #56716 · ray-project/ray

goutamvenkat-anyscale · 2025-09-18T20:33:35Z

Why are these changes needed?

For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression.

Follow up: Deprecate the usage of pyarrow's expression system.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Adds native predicate-expression filtering to Dataset.filter (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline.

API:
- Dataset.filter: add expr: Union[str, Expr]; deprecate string expressions with warning; validate mutual exclusivity with fn; resource arg handling unchanged.
Planner/Operators:
- Logical Filter: accept predicate_expr or fn (exactly one); remove direct pyarrow expression dependency.
- Physical planning: switch to block-level filter via BlockMapTransformFn; UDF path unchanged.
Blocks:
- ArrowBlockAccessor.filter and PandasBlockAccessor.filter: evaluate Expr via eval_expr to produce boolean mask, then filter table/dataframe.
Expressions:
- Add native expression parsing (ExpressionEvaluator.parse_native_expression) and _ConvertToNativeExpressionVisitor to convert string filters to Ray Expr.
Docs:
- Repartition note clarification (rows wording).
Tests/Build:
- New tests/test_filter.py covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new Filter signature; add Bazel target test_filter.

^{Written by Cursor Bugbot for commit 633d5e2. This will update automatically on new commits. Configure here.}

Signed-off-by: Goutam V. <goutam@anyscale.com>

gemini-code-assist

Code Review

This pull request adds support for predicate expressions to dataset.filter(), which is a great step towards a more unified and expressive API for data manipulation in Ray Data. The implementation is clean, follows existing patterns, and is accompanied by a comprehensive set of tests covering various scenarios. The changes in ArrowBlockAccessor and PandasBlockAccessor are correct, and the refactoring in Dataset.filter improves readability. I have one minor suggestion for code cleanup.

python/ray/data/dataset.py

Signed-off-by: Goutam V. <goutam@anyscale.com>

iamjustinhsu

Do u have a PR link to [1/n]? Want to take a look at the Predicate Expression Declaration

python/ray/data/_internal/planner/plan_udf_map_op.py

python/ray/data/tests/test_map.py

python/ray/data/dataset.py

python/ray/data/_internal/logical/operators/map_operator.py

goutamvenkat-anyscale · 2025-09-19T20:21:11Z

Do u have a PR link to [1/n]? Want to take a look at the Predicate Expression Declaration

#56313

Signed-off-by: Goutam V. <goutam@anyscale.com>

python/ray/data/tests/test_map.py

python/ray/data/_internal/logical/operators/map_operator.py

alexeykudinkin · 2025-09-24T00:49:11Z

python/ray/data/_internal/pandas_block.py

+
+    def filter(self, predicate_expr: "Expr") -> "pandas.DataFrame":
+        """Filter rows based on a predicate expression."""
+        from ray.data._expression_evaluator import eval_expr


Let's move this to _internal

I'll make this a TODO just to keep the change cleaner

Yeah, let's do in a follow-up. But let's do it right away

python/ray/data/_internal/planner/plan_udf_map_op.py

python/ray/data/dataset.py

Signed-off-by: Goutam V. <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-09-24T22:32:24Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for predicate expressions to dataset.filter(), which is a great step towards a more unified expression system in Ray Data. The changes are well-structured, with updates to the logical and physical layers, block accessors, and the public Dataset API. The addition of comprehensive tests for the new expression functionality is particularly commendable. I've found one critical issue related to class inheritance in the logical operator definition that needs to be addressed. Otherwise, the implementation looks solid.

python/ray/data/_internal/logical/operators/map_operator.py

alexeykudinkin · 2025-09-25T18:45:28Z

python/ray/data/_internal/pandas_block.py

+
+    def filter(self, predicate_expr: "Expr") -> "pandas.DataFrame":
+        """Filter rows based on a predicate expression."""
+        from ray.data._expression_evaluator import eval_expr


Yeah, let's do in a follow-up. But let's do it right away

python/ray/data/_internal/planner/plan_udf_map_op.py

python/ray/data/dataset.py

python/ray/data/tests/test_map.py

Signed-off-by: Goutam V. <goutam@anyscale.com>

alexeykudinkin

LGTM, minor comments

python/ray/data/_internal/logical/operators/map_operator.py

python/ray/data/_internal/planner/plan_expression/expression_evaluator.py

alexeykudinkin · 2025-09-30T01:09:53Z

python/ray/data/_internal/planner/plan_expression/expression_evaluator.py

+            if isinstance(elem, LiteralExpr):
+                elements.append(elem.value)
+            else:
+                # For compatibility with Arrow visitor, we need to support non-literals
+                # but Ray Data expressions may have limitations here
+                raise ValueError(
+                    "List contains non-constant expressions. Ray Data expressions "
+                    "currently only support lists of constant values for 'in' operations."
+                )


Wait, you don't know if this list is gonna be used in in, right?

python/ray/data/_internal/planner/plan_expression/expression_evaluator.py

python/ray/data/dataset.py

Signed-off-by: Goutam V. <goutam@anyscale.com>

python/ray/data/_internal/planner/plan_expression/expression_evaluator.py

alexeykudinkin · 2025-09-30T22:04:18Z

python/ray/data/_internal/planner/plan_expression/expression_evaluator.py

+            if isinstance(left_expr, ColumnExpr):
+                return col(f"{left_expr._name}.{node.attr}")
+
+        raise ValueError(f"Unsupported attribute access: {node.attr}")


This is not gonna be enough for us to debug it, right?

Add the log of the whole node, plus expr we parsed

I'll use ast.dump on the node

python/ray/data/dataset.py

Signed-off-by: Goutam V. <goutam@anyscale.com>

…ay-project#56716)   ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…ay-project#56716)   ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…ay-project#56716)   ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Goutam V. <goutam@anyscale.com>

…ay-project#56716)   ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>

…ay-project#56716)   ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Goutam V. <goutam@anyscale.com>

Original PR #56716 by goutamvenkat-anyscale Original: ray-project/ray#56716

…t for dataset.filter Merged from original PR #56716 Original: ray-project/ray#56716

…ay-project#56716)   ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Goutam V. <goutam@anyscale.com>

…ay-project#56716)   ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…ay-project#56716)   ## Why are these changes needed? For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression. Follow up: Deprecate the usage of pyarrow's expression system. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Adds native predicate-expression filtering to `Dataset.filter` (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline. > > - **API**: > - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string expressions with warning; validate mutual exclusivity with `fn`; resource arg handling unchanged. > - **Planner/Operators**: > - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one); remove direct pyarrow expression dependency. > - Physical planning: switch to block-level filter via `BlockMapTransformFn`; UDF path unchanged. > - **Blocks**: > - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`: evaluate `Expr` via `eval_expr` to produce boolean mask, then filter table/dataframe. > - **Expressions**: > - Add native expression parsing (`ExpressionEvaluator.parse_native_expression`) and `_ConvertToNativeExpressionVisitor` to convert string filters to Ray `Expr`. > - **Docs**: > - Repartition note clarification (rows wording). > - **Tests/Build**: > - New `tests/test_filter.py` covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new `Filter` signature; add Bazel target `test_filter`. > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 633d5e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

[Data] [2/2] - Add predicate expression support for dataset.filter

8db6b11

Signed-off-by: Goutam V. <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner September 18, 2025 20:33

goutamvenkat-anyscale added the data Ray Data-related issues label Sep 18, 2025

goutamvenkat-anyscale changed the title ~~[Data] [2/2] - Add predicate expression support for dataset.filter~~ [Data] [2/n] - Add predicate expression support for dataset.filter Sep 18, 2025

gemini-code-assist bot reviewed Sep 18, 2025

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Sep 18, 2025

merge master + conflicts

c664348

Signed-off-by: Goutam V. <goutam@anyscale.com>

iamjustinhsu reviewed Sep 19, 2025

View reviewed changes

Fix doclint + address comments

27be0de

Signed-off-by: Goutam V. <goutam@anyscale.com>

iamjustinhsu approved these changes Sep 19, 2025

View reviewed changes

python/ray/data/tests/test_map.py Outdated Show resolved Hide resolved

alexeykudinkin reviewed Sep 24, 2025

View reviewed changes

Address comments

096051c

Signed-off-by: Goutam V. <goutam@anyscale.com>

This comment was marked as outdated.

Sign in to view

fix test

2d84e77

Signed-off-by: Goutam V. <goutam@anyscale.com>

gemini-code-assist bot reviewed Sep 24, 2025

View reviewed changes

alexeykudinkin reviewed Sep 25, 2025

View reviewed changes

goutamvenkat-anyscale added 2 commits September 26, 2025 11:08

Merge branch 'master' into goutam/predicate_expr_filter_api

d250215

Respond to comments

7ff4f9e

Signed-off-by: Goutam V. <goutam@anyscale.com>

goutamvenkat-anyscale force-pushed the goutam/predicate_expr_filter_api branch from 3d63200 to 7ff4f9e Compare September 26, 2025 18:34

goutamvenkat-anyscale added 4 commits September 26, 2025 16:12

Merge branch 'master' into goutam/predicate_expr_filter_api

7758d67

Cleanup + test fix

735faa6

Signed-off-by: Goutam V. <goutam@anyscale.com>

Merge branch 'master' into goutam/predicate_expr_filter_api

b85fb1a

Fix test

9884ff8

Signed-off-by: Goutam V. <goutam@anyscale.com>

This comment was marked as outdated.

Sign in to view

Fix bazel file

1f6872c

Signed-off-by: Goutam V. <goutam@anyscale.com>

alexeykudinkin reviewed Sep 30, 2025

View reviewed changes

Address comments

86fb533

Signed-off-by: Goutam V. <goutam@anyscale.com>

alexeykudinkin approved these changes Sep 30, 2025

View reviewed changes

Comments

633d5e2

Signed-off-by: Goutam V. <goutam@anyscale.com>

alexeykudinkin merged commit f406785 into ray-project:master Oct 1, 2025
6 checks passed

snorkelopstesting2-coder mentioned this pull request Oct 22, 2025

[Data] [2/n] - Add predicate expression support for dataset.filter snorkel-marlin-repos/ray-project_ray_pr_56716_d29cb404-654d-4b7e-ace2-e77d8a223682#1

Merged

Conversation

goutamvenkat-anyscale commented Sep 18, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

iamjustinhsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Sep 19, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

goutamvenkat-anyscale commented Sep 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

alexeykudinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

goutamvenkat-anyscale commented Sep 18, 2025 •

edited by cursor bot

Loading