Skip to content

[FEA] Add option for JIT filter for the row-wise post-filter in read_parquet #19384

@GregoryKimball

Description

@GregoryKimball

Is your feature request related to a problem? Please describe.
I wish we had a parquet reader option to perform the row-wise post-filter using the JIT filter API.

Describe the solution you'd like
The parquet reader would continue to accept a _filter typed as a cuDF AST. Then we could have another reader option to enable JIT filtering for the final post filter. We would continue using AST-based filtering for row group stats.

For the post filter, today we are using compute_column and apply_boolean_mask on the materialized table before returning the row-wise filtered table. We need this step because when row groups min/max stats pass the filter predicate, many of the individual rows in the row group may not pass the filter predicate.

To do the post-filter using a JIT-compiled filter, we would need to first create a utility that converts a cuDF AST into a raw string UDF. This probably means a tree traversal and code gen to build up the raw string, plus some extra plumbing to figure out the best way to manage scalars and columns. Presumably this code gen utility would go into a public or detail JIT utilities location. If the AST-to-UDF utility turns out to be too complicated we may need to revisit our approach.

Adding JIT transforms or JIT filters could also be useful for row group and page stats processing, but we should figure that out after this first step.

Describe alternatives you've considered
We could instead target the compute_column API, providing a JIT-based compute column that does the code gen based on the cuDF AST. We would still need an AST-to-UDF conversion utility, but this might be a broader way to test JIT options across cuDF.

The downside of plugging into compute_column is:

  • that it would require changes to query operators to swap between AST and JIT compute_column, whereas the post-filter in read parquet can be treated as an implementation detail.
  • compute_column may lead to excessive JIT compilations per query (one for row group filter, one for post filter, one for each filterproject expression). Whereas the read_parquet option could create up to one kernel per table in most cases.
  • compute_column could be called on tiny amounts of data, e.g. post reduction. On the other hand, the parquet post-filter will generally have a large row count.

Additional context
We probably want to add an environment variable to toggle the JIT filter option in read_parquet so that testing is easy for cudf-polars and possibly Spark-RAPIDS.

We may want to setup the JIT filter to treat scalars in the predicate as parameters to the JIT kernel rather than literals in the raw string. This would improve the cache hits for cases when the filter predicate uses the same columns, types and tree structure and only the scalars are changing.

Metadata

Metadata

Labels

cuIOcuIO issuefeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions