[Data] Lance predicate pushdown by petern48 · Pull Request #61400 · ray-project/ray

petern48 · 2026-02-28T17:14:08Z

Description

Pushdown predicates specified in .filter() to the Lance for data skipping when reading.
Additionally, this PR deprecates the filter argument in the existing read_lance() API in favor of encouraging users to specify filters using the dataframe API (.filter()). The same was done for read_parquet() when adding support for parquet predicate pushdown in [Data] - Add Predicate Pushdown Rule #58150
- If predicates are specified in both methods, the predicates are AND'd together so they're both pushed down

Related issues

Additional information

Lance's scanner() API has a filter argument (here) where we can specify predicates to be pushed down into the read.

Currently, we already pushdown predicates specified in read_lance()'s filter argument. (Saves the filter argument here and then passes it into scanner here).

However, it did not pushdown predicates specified from the dataframe API .filter(), which is generally a better practice. This PR adds that support.

Signed-off-by: Peter Nguyen <petern0408@gmail.com>

gemini-code-assist

Code Review

This pull request adds predicate pushdown support for Lance datasets, allowing filters from the .filter() API to be pushed down to the read layer for better performance. It also deprecates the filter argument in read_lance, encouraging users to use the more idiomatic dataframe API. The implementation correctly combines predicates from both sources if provided. The changes are well-tested. I've found a potential bug related to in-place modification of scanner_options and a minor typo in a user-facing warning message. Overall, this is a good improvement.

python/ray/data/read_api.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Peter Nguyen <petern0408@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-02-28T18:11:24Z

python/ray/data/_internal/datasource/lance_datasource.py

+            str(self._predicate_expr.to_pyarrow())
+            if self._predicate_expr is not None
+            else None
+        )


Fragile string conversion of pyarrow Expression for Lance filter

Medium Severity

The predicate expression is converted via str(self._predicate_expr.to_pyarrow()), but both the Parquet and CSV datasources pass the pyarrow.compute.Expression object directly without calling str(). The str() of a pyarrow Expression may include a wrapper like <pyarrow.compute.Expression ...>, which would not be valid SQL for Lance's scanner. Since Lance's scanner natively accepts pa.compute.Expression objects in addition to SQL strings, the expression object could be passed directly when filter_from_arg is None, avoiding any string format fragility.

tldr; Getting pa.compute.Expression to work is very complex, and I think using strs instead is fine and a safe approach.

I've tried passing in PyArrow expressions at first, but ran into some trouble. I've found that Lance converts the pyarrow.compute.Expression into substrait, which results in the following error.

/ray/.venv/lib/python3.10/site-packages/lance/dataset.py", line 4775, in filter substrait_filter = serialize_expressions( File "pyarrow/_substrait.pyx", line 353, in pyarrow._substrait.serialize_expressions File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Substrait is only capable of representing unsafe casts

Apparently, Ray Data's Expr.to_pyarrow() results in an expression that is not compatible with Substrait due to casting reasons.

Calling str() on this seems to be a clean way to get this to work, and I believe it's a safe approach:

Lance seems to fallback to str()-ing the pyarrow.compute.expressions for this exact case anyways when substrait isn't installed (here). It also converts pc.Expressions to str()s in various other places like here

lance-spark also uses SQL strings instead of substrait here

Even if it did work, performing the AND between the pc.Expression and the other str predicate for the code in this PR is easier through strings (there doesn't seem to be an easy way to compute the str to a pc.Expression bool predicate).

petern48 added 2 commits February 28, 2026 08:59

Implement predicate pushdown for lance data source

b079658

Signed-off-by: Peter Nguyen <petern0408@gmail.com>

Deprecate filter argument in read_lance()

93edfe0

Signed-off-by: Peter Nguyen <petern0408@gmail.com>

gemini-code-assist bot reviewed Feb 28, 2026

View reviewed changes

python/ray/data/read_api.py Outdated Show resolved Hide resolved

Fix typo in read_api.py docs

8be3ddb

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Peter Nguyen <petern0408@gmail.com>

petern48 marked this pull request as ready for review February 28, 2026 17:59

petern48 requested a review from a team as a code owner February 28, 2026 17:59

cursor bot reviewed Feb 28, 2026

View reviewed changes

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Feb 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Lance predicate pushdown#61400

[Data] Lance predicate pushdown#61400
petern48 wants to merge 3 commits intoray-project:masterfrom
petern48:lance_predicate_pushdown

petern48 commented Feb 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 28, 2026

Uh oh!

petern48 Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

petern48 commented Feb 28, 2026

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 28, 2026

Choose a reason for hiding this comment

Fragile string conversion of pyarrow Expression for Lance filter

Uh oh!

petern48 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant