[FEA] Add option for JIT filter for the row-wise post-filter in read_parquet

**Is your feature request related to a problem? Please describe.**
I wish we had a parquet reader option to perform the row-wise post-filter using the JIT filter API.

**Describe the solution you'd like**
The parquet reader would continue to accept a `_filter` typed as a cuDF AST. Then we could have another reader option to enable JIT filtering for the final post filter. We would continue using AST-based filtering for row group stats. 

For the post filter, today we are using `compute_column` and `apply_boolean_mask` on the materialized table before returning the row-wise filtered table. We need this step because when row groups min/max stats pass the filter predicate, many of the individual rows in the row group may not pass the filter predicate. 

To do the post-filter using a JIT-compiled filter, we would need to first create a utility that converts a cuDF AST into a raw string UDF. This probably means a tree traversal and code gen to build up the raw string, plus some extra plumbing to figure out the best way to manage scalars and columns. Presumably this code gen utility would go into a  public or detail JIT utilities location. If the AST-to-UDF utility turns out to be too complicated we may need to revisit our approach.

Adding JIT transforms or JIT filters could also be useful for row group and page stats processing, but we should figure that out after this first step.

**Describe alternatives you've considered**
We could instead target the `compute_column` API, providing a JIT-based compute column that does the code gen based on the cuDF AST. We would still need an AST-to-UDF conversion utility, but this might be a broader way to test JIT options across cuDF.

The downside of plugging into `compute_column` is:
* that it would require changes to query operators to swap between AST and JIT `compute_column`, whereas the post-filter in read parquet can be treated as an implementation detail. 
* `compute_column` may lead to excessive JIT compilations per query (one for row group filter, one for post filter, one for each filterproject expression). Whereas the read_parquet option could create up to one kernel per table in most cases.
* `compute_column` could be called on tiny amounts of data, e.g. post reduction. On the other hand, the parquet post-filter will generally have a large row count.

**Additional context**
We probably want to add an environment variable to toggle the JIT filter option in `read_parquet` so that testing is easy for cudf-polars and possibly Spark-RAPIDS.

We may want to setup the JIT filter to treat scalars in the predicate as parameters to the JIT kernel rather than literals in the raw string. This would improve the cache hits for cases when the filter predicate uses the same columns, types and tree structure and only the scalars are changing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Add option for JIT filter for the row-wise post-filter in read_parquet #19384

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Add option for JIT filter for the row-wise post-filter in read_parquet #19384

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions