Skip to content

[Data] [2/n] - Add predicate expression support for dataset.filter#56716

Merged
alexeykudinkin merged 14 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/predicate_expr_filter_api
Oct 1, 2025
Merged

[Data] [2/n] - Add predicate expression support for dataset.filter#56716
alexeykudinkin merged 14 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/predicate_expr_filter_api

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale commented Sep 18, 2025

Why are these changes needed?

For dataset.filter() add support for predicate expressions, which is part of Ray Data's expression system and will soon replace the fn and string based expr that gets evaluated as a Pyarrow expression.

Follow up: Deprecate the usage of pyarrow's expression system.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Adds native predicate-expression filtering to Dataset.filter (with string exprs deprecated), executed via Arrow/Pandas block accessors and updated planner/operator pipeline.

  • API:
    • Dataset.filter: add expr: Union[str, Expr]; deprecate string expressions with warning; validate mutual exclusivity with fn; resource arg handling unchanged.
  • Planner/Operators:
    • Logical Filter: accept predicate_expr or fn (exactly one); remove direct pyarrow expression dependency.
    • Physical planning: switch to block-level filter via BlockMapTransformFn; UDF path unchanged.
  • Blocks:
    • ArrowBlockAccessor.filter and PandasBlockAccessor.filter: evaluate Expr via eval_expr to produce boolean mask, then filter table/dataframe.
  • Expressions:
    • Add native expression parsing (ExpressionEvaluator.parse_native_expression) and _ConvertToNativeExpressionVisitor to convert string filters to Ray Expr.
  • Docs:
    • Repartition note clarification (rows wording).
  • Tests/Build:
    • New tests/test_filter.py covering predicate expressions, parity with UDFs, block-format compatibility, and invalid cases; adjust existing tests to new Filter signature; add Bazel target test_filter.

Written by Cursor Bugbot for commit 633d5e2. This will update automatically on new commits. Configure here.

Signed-off-by: Goutam V. <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner September 18, 2025 20:33
@goutamvenkat-anyscale goutamvenkat-anyscale added the data Ray Data-related issues label Sep 18, 2025
@goutamvenkat-anyscale goutamvenkat-anyscale changed the title [Data] [2/2] - Add predicate expression support for dataset.filter [Data] [2/n] - Add predicate expression support for dataset.filter Sep 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for predicate expressions to dataset.filter(), which is a great step towards a more unified and expressive API for data manipulation in Ray Data. The implementation is clean, follows existing patterns, and is accompanied by a comprehensive set of tests covering various scenarios. The changes in ArrowBlockAccessor and PandasBlockAccessor are correct, and the refactoring in Dataset.filter improves readability. I have one minor suggestion for code cleanup.

@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Sep 18, 2025
Signed-off-by: Goutam V. <goutam@anyscale.com>
Copy link
Contributor

@iamjustinhsu iamjustinhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do u have a PR link to [1/n]? Want to take a look at the Predicate Expression Declaration

@goutamvenkat-anyscale
Copy link
Contributor Author

Do u have a PR link to [1/n]? Want to take a look at the Predicate Expression Declaration

#56313

Signed-off-by: Goutam V. <goutam@anyscale.com>

def filter(self, predicate_expr: "Expr") -> "pandas.DataFrame":
"""Filter rows based on a predicate expression."""
from ray.data._expression_evaluator import eval_expr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to _internal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make this a TODO just to keep the change cleaner

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's do in a follow-up. But let's do it right away

Signed-off-by: Goutam V. <goutam@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Goutam V. <goutam@anyscale.com>
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for predicate expressions to dataset.filter(), which is a great step towards a more unified expression system in Ray Data. The changes are well-structured, with updates to the logical and physical layers, block accessors, and the public Dataset API. The addition of comprehensive tests for the new expression functionality is particularly commendable. I've found one critical issue related to class inheritance in the logical operator definition that needs to be addressed. Otherwise, the implementation looks solid.


def filter(self, predicate_expr: "Expr") -> "pandas.DataFrame":
"""Filter rows based on a predicate expression."""
from ray.data._expression_evaluator import eval_expr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's do in a follow-up. But let's do it right away

@goutamvenkat-anyscale goutamvenkat-anyscale force-pushed the goutam/predicate_expr_filter_api branch from 3d63200 to 7ff4f9e Compare September 26, 2025 18:34
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Goutam V. <goutam@anyscale.com>
Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comments

Comment on lines 350 to 358
if isinstance(elem, LiteralExpr):
elements.append(elem.value)
else:
# For compatibility with Arrow visitor, we need to support non-literals
# but Ray Data expressions may have limitations here
raise ValueError(
"List contains non-constant expressions. Ray Data expressions "
"currently only support lists of constant values for 'in' operations."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, you don't know if this list is gonna be used in in, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Signed-off-by: Goutam V. <goutam@anyscale.com>
if isinstance(left_expr, ColumnExpr):
return col(f"{left_expr._name}.{node.attr}")

raise ValueError(f"Unsupported attribute access: {node.attr}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not gonna be enough for us to debug it, right?

Add the log of the whole node, plus expr we parsed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use ast.dump on the node

Signed-off-by: Goutam V. <goutam@anyscale.com>
@alexeykudinkin alexeykudinkin merged commit f406785 into ray-project:master Oct 1, 2025
6 checks passed
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…ay-project#56716)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

For dataset.filter() add support for predicate expressions, which is
part of Ray Data's expression system and will soon replace the fn and
string based expr that gets evaluated as a Pyarrow expression.

Follow up: Deprecate the usage of pyarrow's expression system.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds native predicate-expression filtering to `Dataset.filter` (with
string exprs deprecated), executed via Arrow/Pandas block accessors and
updated planner/operator pipeline.
>
> - **API**:
> - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string
expressions with warning; validate mutual exclusivity with `fn`;
resource arg handling unchanged.
> - **Planner/Operators**:
> - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one);
remove direct pyarrow expression dependency.
> - Physical planning: switch to block-level filter via
`BlockMapTransformFn`; UDF path unchanged.
> - **Blocks**:
> - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`:
evaluate `Expr` via `eval_expr` to produce boolean mask, then filter
table/dataframe.
> - **Expressions**:
> - Add native expression parsing
(`ExpressionEvaluator.parse_native_expression`) and
`_ConvertToNativeExpressionVisitor` to convert string filters to Ray
`Expr`.
> - **Docs**:
>   - Repartition note clarification (rows wording).
> - **Tests/Build**:
> - New `tests/test_filter.py` covering predicate expressions, parity
with UDFs, block-format compatibility, and invalid cases; adjust
existing tests to new `Filter` signature; add Bazel target
`test_filter`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
633d5e2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…ay-project#56716)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

For dataset.filter() add support for predicate expressions, which is
part of Ray Data's expression system and will soon replace the fn and
string based expr that gets evaluated as a Pyarrow expression.

Follow up: Deprecate the usage of pyarrow's expression system.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds native predicate-expression filtering to `Dataset.filter` (with
string exprs deprecated), executed via Arrow/Pandas block accessors and
updated planner/operator pipeline.
>
> - **API**:
> - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string
expressions with warning; validate mutual exclusivity with `fn`;
resource arg handling unchanged.
> - **Planner/Operators**:
> - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one);
remove direct pyarrow expression dependency.
> - Physical planning: switch to block-level filter via
`BlockMapTransformFn`; UDF path unchanged.
> - **Blocks**:
> - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`:
evaluate `Expr` via `eval_expr` to produce boolean mask, then filter
table/dataframe.
> - **Expressions**:
> - Add native expression parsing
(`ExpressionEvaluator.parse_native_expression`) and
`_ConvertToNativeExpressionVisitor` to convert string filters to Ray
`Expr`.
> - **Docs**:
>   - Repartition note clarification (rows wording).
> - **Tests/Build**:
> - New `tests/test_filter.py` covering predicate expressions, parity
with UDFs, block-format compatibility, and invalid cases; adjust
existing tests to new `Filter` signature; add Bazel target
`test_filter`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
633d5e2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
…ay-project#56716)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

For dataset.filter() add support for predicate expressions, which is
part of Ray Data's expression system and will soon replace the fn and
string based expr that gets evaluated as a Pyarrow expression.

Follow up: Deprecate the usage of pyarrow's expression system.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds native predicate-expression filtering to `Dataset.filter` (with
string exprs deprecated), executed via Arrow/Pandas block accessors and
updated planner/operator pipeline.
> 
> - **API**:
> - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string
expressions with warning; validate mutual exclusivity with `fn`;
resource arg handling unchanged.
> - **Planner/Operators**:
> - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one);
remove direct pyarrow expression dependency.
> - Physical planning: switch to block-level filter via
`BlockMapTransformFn`; UDF path unchanged.
> - **Blocks**:
> - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`:
evaluate `Expr` via `eval_expr` to produce boolean mask, then filter
table/dataframe.
> - **Expressions**:
> - Add native expression parsing
(`ExpressionEvaluator.parse_native_expression`) and
`_ConvertToNativeExpressionVisitor` to convert string filters to Ray
`Expr`.
> - **Docs**:
>   - Repartition note clarification (rows wording).
> - **Tests/Build**:
> - New `tests/test_filter.py` covering predicate expressions, parity
with UDFs, block-format compatibility, and invalid cases; adjust
existing tests to new `Filter` signature; add Bazel target
`test_filter`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
633d5e2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
…ay-project#56716)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

For dataset.filter() add support for predicate expressions, which is
part of Ray Data's expression system and will soon replace the fn and
string based expr that gets evaluated as a Pyarrow expression.

Follow up: Deprecate the usage of pyarrow's expression system.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds native predicate-expression filtering to `Dataset.filter` (with
string exprs deprecated), executed via Arrow/Pandas block accessors and
updated planner/operator pipeline.
>
> - **API**:
> - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string
expressions with warning; validate mutual exclusivity with `fn`;
resource arg handling unchanged.
> - **Planner/Operators**:
> - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one);
remove direct pyarrow expression dependency.
> - Physical planning: switch to block-level filter via
`BlockMapTransformFn`; UDF path unchanged.
> - **Blocks**:
> - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`:
evaluate `Expr` via `eval_expr` to produce boolean mask, then filter
table/dataframe.
> - **Expressions**:
> - Add native expression parsing
(`ExpressionEvaluator.parse_native_expression`) and
`_ConvertToNativeExpressionVisitor` to convert string filters to Ray
`Expr`.
> - **Docs**:
>   - Repartition note clarification (rows wording).
> - **Tests/Build**:
> - New `tests/test_filter.py` covering predicate expressions, parity
with UDFs, block-format compatibility, and invalid cases; adjust
existing tests to new `Filter` signature; add Bazel target
`test_filter`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
633d5e2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ay-project#56716)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

For dataset.filter() add support for predicate expressions, which is
part of Ray Data's expression system and will soon replace the fn and
string based expr that gets evaluated as a Pyarrow expression.

Follow up: Deprecate the usage of pyarrow's expression system.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds native predicate-expression filtering to `Dataset.filter` (with
string exprs deprecated), executed via Arrow/Pandas block accessors and
updated planner/operator pipeline.
> 
> - **API**:
> - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string
expressions with warning; validate mutual exclusivity with `fn`;
resource arg handling unchanged.
> - **Planner/Operators**:
> - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one);
remove direct pyarrow expression dependency.
> - Physical planning: switch to block-level filter via
`BlockMapTransformFn`; UDF path unchanged.
> - **Blocks**:
> - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`:
evaluate `Expr` via `eval_expr` to produce boolean mask, then filter
table/dataframe.
> - **Expressions**:
> - Add native expression parsing
(`ExpressionEvaluator.parse_native_expression`) and
`_ConvertToNativeExpressionVisitor` to convert string filters to Ray
`Expr`.
> - **Docs**:
>   - Repartition note clarification (rows wording).
> - **Tests/Build**:
> - New `tests/test_filter.py` covering predicate expressions, parity
with UDFs, block-format compatibility, and invalid cases; adjust
existing tests to new `Filter` signature; add Bazel target
`test_filter`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
633d5e2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
snorkelopstesting2-coder pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_56716_d29cb404-654d-4b7e-ace2-e77d8a223682 that referenced this pull request Oct 22, 2025
snorkelopstesting2-coder added a commit to snorkel-marlin-repos/ray-project_ray_pr_56716_d29cb404-654d-4b7e-ace2-e77d8a223682 that referenced this pull request Oct 22, 2025
…t for dataset.filter

Merged from original PR #56716
Original: ray-project/ray#56716
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ay-project#56716)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

For dataset.filter() add support for predicate expressions, which is
part of Ray Data's expression system and will soon replace the fn and
string based expr that gets evaluated as a Pyarrow expression.

Follow up: Deprecate the usage of pyarrow's expression system.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds native predicate-expression filtering to `Dataset.filter` (with
string exprs deprecated), executed via Arrow/Pandas block accessors and
updated planner/operator pipeline.
> 
> - **API**:
> - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string
expressions with warning; validate mutual exclusivity with `fn`;
resource arg handling unchanged.
> - **Planner/Operators**:
> - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one);
remove direct pyarrow expression dependency.
> - Physical planning: switch to block-level filter via
`BlockMapTransformFn`; UDF path unchanged.
> - **Blocks**:
> - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`:
evaluate `Expr` via `eval_expr` to produce boolean mask, then filter
table/dataframe.
> - **Expressions**:
> - Add native expression parsing
(`ExpressionEvaluator.parse_native_expression`) and
`_ConvertToNativeExpressionVisitor` to convert string filters to Ray
`Expr`.
> - **Docs**:
>   - Repartition note clarification (rows wording).
> - **Tests/Build**:
> - New `tests/test_filter.py` covering predicate expressions, parity
with UDFs, block-format compatibility, and invalid cases; adjust
existing tests to new `Filter` signature; add Bazel target
`test_filter`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
633d5e2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ay-project#56716)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

For dataset.filter() add support for predicate expressions, which is
part of Ray Data's expression system and will soon replace the fn and
string based expr that gets evaluated as a Pyarrow expression.

Follow up: Deprecate the usage of pyarrow's expression system.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds native predicate-expression filtering to `Dataset.filter` (with
string exprs deprecated), executed via Arrow/Pandas block accessors and
updated planner/operator pipeline.
>
> - **API**:
> - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string
expressions with warning; validate mutual exclusivity with `fn`;
resource arg handling unchanged.
> - **Planner/Operators**:
> - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one);
remove direct pyarrow expression dependency.
> - Physical planning: switch to block-level filter via
`BlockMapTransformFn`; UDF path unchanged.
> - **Blocks**:
> - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`:
evaluate `Expr` via `eval_expr` to produce boolean mask, then filter
table/dataframe.
> - **Expressions**:
> - Add native expression parsing
(`ExpressionEvaluator.parse_native_expression`) and
`_ConvertToNativeExpressionVisitor` to convert string filters to Ray
`Expr`.
> - **Docs**:
>   - Repartition note clarification (rows wording).
> - **Tests/Build**:
> - New `tests/test_filter.py` covering predicate expressions, parity
with UDFs, block-format compatibility, and invalid cases; adjust
existing tests to new `Filter` signature; add Bazel target
`test_filter`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
633d5e2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ay-project#56716)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

For dataset.filter() add support for predicate expressions, which is
part of Ray Data's expression system and will soon replace the fn and
string based expr that gets evaluated as a Pyarrow expression.

Follow up: Deprecate the usage of pyarrow's expression system.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds native predicate-expression filtering to `Dataset.filter` (with
string exprs deprecated), executed via Arrow/Pandas block accessors and
updated planner/operator pipeline.
>
> - **API**:
> - `Dataset.filter`: add `expr: Union[str, Expr]`; deprecate string
expressions with warning; validate mutual exclusivity with `fn`;
resource arg handling unchanged.
> - **Planner/Operators**:
> - Logical `Filter`: accept `predicate_expr` or `fn` (exactly one);
remove direct pyarrow expression dependency.
> - Physical planning: switch to block-level filter via
`BlockMapTransformFn`; UDF path unchanged.
> - **Blocks**:
> - `ArrowBlockAccessor.filter` and `PandasBlockAccessor.filter`:
evaluate `Expr` via `eval_expr` to produce boolean mask, then filter
table/dataframe.
> - **Expressions**:
> - Add native expression parsing
(`ExpressionEvaluator.parse_native_expression`) and
`_ConvertToNativeExpressionVisitor` to convert string filters to Ray
`Expr`.
> - **Docs**:
>   - Repartition note clarification (rows wording).
> - **Tests/Build**:
> - New `tests/test_filter.py` covering predicate expressions, parity
with UDFs, block-format compatibility, and invalid cases; adjust
existing tests to new `Filter` signature; add Bazel target
`test_filter`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
633d5e2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants