Skip to content

[Data] Lance predicate pushdown#61400

Open
petern48 wants to merge 3 commits intoray-project:masterfrom
petern48:lance_predicate_pushdown
Open

[Data] Lance predicate pushdown#61400
petern48 wants to merge 3 commits intoray-project:masterfrom
petern48:lance_predicate_pushdown

Conversation

@petern48
Copy link
Contributor

Description

  • Pushdown predicates specified in .filter() to the Lance for data skipping when reading.
  • Additionally, this PR deprecates the filter argument in the existing read_lance() API in favor of encouraging users to specify filters using the dataframe API (.filter()). The same was done for read_parquet() when adding support for parquet predicate pushdown in [Data] - Add Predicate Pushdown Rule #58150
    • If predicates are specified in both methods, the predicates are AND'd together so they're both pushed down

Related issues

Fixes #61399

Additional information

Lance's scanner() API has a filter argument (here) where we can specify predicates to be pushed down into the read.

Currently, we already pushdown predicates specified in read_lance()'s filter argument. (Saves the filter argument here and then passes it into scanner here).

However, it did not pushdown predicates specified from the dataframe API .filter(), which is generally a better practice. This PR adds that support.

Signed-off-by: Peter Nguyen <petern0408@gmail.com>
Signed-off-by: Peter Nguyen <petern0408@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds predicate pushdown support for Lance datasets, allowing filters from the .filter() API to be pushed down to the read layer for better performance. It also deprecates the filter argument in read_lance, encouraging users to use the more idiomatic dataframe API. The implementation correctly combines predicates from both sources if provided. The changes are well-tested. I've found a potential bug related to in-place modification of scanner_options and a minor typo in a user-facing warning message. Overall, this is a good improvement.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Peter Nguyen <petern0408@gmail.com>
@petern48 petern48 marked this pull request as ready for review February 28, 2026 17:59
@petern48 petern48 requested a review from a team as a code owner February 28, 2026 17:59
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

str(self._predicate_expr.to_pyarrow())
if self._predicate_expr is not None
else None
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile string conversion of pyarrow Expression for Lance filter

Medium Severity

The predicate expression is converted via str(self._predicate_expr.to_pyarrow()), but both the Parquet and CSV datasources pass the pyarrow.compute.Expression object directly without calling str(). The str() of a pyarrow Expression may include a wrapper like <pyarrow.compute.Expression ...>, which would not be valid SQL for Lance's scanner. Since Lance's scanner natively accepts pa.compute.Expression objects in addition to SQL strings, the expression object could be passed directly when filter_from_arg is None, avoiding any string format fragility.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tldr; Getting pa.compute.Expression to work is very complex, and I think using strs instead is fine and a safe approach.

I've tried passing in PyArrow expressions at first, but ran into some trouble. I've found that Lance converts the pyarrow.compute.Expression into substrait, which results in the following error.

/ray/.venv/lib/python3.10/site-packages/lance/dataset.py", line 4775, in filter
    substrait_filter = serialize_expressions(
  File "pyarrow/_substrait.pyx", line 353, in pyarrow._substrait.serialize_expressions
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Substrait is only capable of representing unsafe casts

Apparently, Ray Data's Expr.to_pyarrow() results in an expression that is not compatible with Substrait due to casting reasons.

Calling str() on this seems to be a clean way to get this to work, and I believe it's a safe approach:

  • Lance seems to fallback to str()-ing the pyarrow.compute.expressions for this exact case anyways when substrait isn't installed (here). It also converts pc.Expressions to str()s in various other places like here
  • lance-spark also uses SQL strings instead of substrait here

Even if it did work, performing the AND between the pc.Expression and the other str predicate for the code in this PR is easier through strings (there doesn't seem to be an easy way to compute the str to a pc.Expression bool predicate).

@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Support predicate pushdown for Lance

1 participant