[Data] Lance predicate pushdown#61400
Conversation
Signed-off-by: Peter Nguyen <petern0408@gmail.com>
Signed-off-by: Peter Nguyen <petern0408@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request adds predicate pushdown support for Lance datasets, allowing filters from the .filter() API to be pushed down to the read layer for better performance. It also deprecates the filter argument in read_lance, encouraging users to use the more idiomatic dataframe API. The implementation correctly combines predicates from both sources if provided. The changes are well-tested. I've found a potential bug related to in-place modification of scanner_options and a minor typo in a user-facing warning message. Overall, this is a good improvement.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Peter Nguyen <petern0408@gmail.com>
| str(self._predicate_expr.to_pyarrow()) | ||
| if self._predicate_expr is not None | ||
| else None | ||
| ) |
There was a problem hiding this comment.
Fragile string conversion of pyarrow Expression for Lance filter
Medium Severity
The predicate expression is converted via str(self._predicate_expr.to_pyarrow()), but both the Parquet and CSV datasources pass the pyarrow.compute.Expression object directly without calling str(). The str() of a pyarrow Expression may include a wrapper like <pyarrow.compute.Expression ...>, which would not be valid SQL for Lance's scanner. Since Lance's scanner natively accepts pa.compute.Expression objects in addition to SQL strings, the expression object could be passed directly when filter_from_arg is None, avoiding any string format fragility.
There was a problem hiding this comment.
tldr; Getting pa.compute.Expression to work is very complex, and I think using strs instead is fine and a safe approach.
I've tried passing in PyArrow expressions at first, but ran into some trouble. I've found that Lance converts the pyarrow.compute.Expression into substrait, which results in the following error.
/ray/.venv/lib/python3.10/site-packages/lance/dataset.py", line 4775, in filter
substrait_filter = serialize_expressions(
File "pyarrow/_substrait.pyx", line 353, in pyarrow._substrait.serialize_expressions
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Substrait is only capable of representing unsafe casts
Apparently, Ray Data's Expr.to_pyarrow() results in an expression that is not compatible with Substrait due to casting reasons.
Calling str() on this seems to be a clean way to get this to work, and I believe it's a safe approach:
- Lance seems to fallback to
str()-ing thepyarrow.compute.expressionsfor this exact case anyways when substrait isn't installed (here). It also convertspc.Expressions tostr()s in various other places like here lance-sparkalso uses SQL strings instead of substrait here
Even if it did work, performing the AND between the pc.Expression and the other str predicate for the code in this PR is easier through strings (there doesn't seem to be an easy way to compute the str to a pc.Expression bool predicate).


Description
.filter()to the Lance for data skipping when reading.filterargument in the existingread_lance()API in favor of encouraging users to specify filters using the dataframe API (.filter()). The same was done forread_parquet()when adding support for parquet predicate pushdown in [Data] - Add Predicate Pushdown Rule #58150AND'd together so they're both pushed downRelated issues
Fixes #61399
Additional information
Lance's
scanner()API has afilterargument (here) where we can specify predicates to be pushed down into the read.Currently, we already pushdown predicates specified in
read_lance()'sfilterargument. (Saves the filter argument here and then passes it into scanner here).However, it did not pushdown predicates specified from the dataframe API
.filter(), which is generally a better practice. This PR adds that support.