-
Notifications
You must be signed in to change notification settings - Fork 961
🚧 Implement an experimental Parquet reader optimized for highly-selective hybrid scan reads #18011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚧 Implement an experimental Parquet reader optimized for highly-selective hybrid scan reads #18011
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
@@ -0,0 +1,224 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy pasted from reader_impl_chunking.cu
for now. No need to review
@@ -0,0 +1,82 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy pasted from reader_impl_preprocess.cu
. No need to review.
… metadata APIs (#18480) Contributes to #17896. Part of #18011. This PR adds the high level interface (APIs) to a new experimental Parquet reader optimized for highly selective (hybrid scan) queries. The PR also adds implementations for the basic metadata related APIs of the new reader such as reading the file footer and PageIndex. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec URL: #18480
) Contributes to #17896. Part of #18011. This PR implements row group pruning with stats in the experimental Parquet reader optimized for hybrid scan queries Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #18543
…der (#18545) Contributes to #17896. Part of #18011. This PR implements row group pruning with bloom filters in the experimental Parquet reader optimized for hybrid scan queries. Dictionary based row group pruning is still WIP in a separate branch and so this PR has empty definitions where needed. Note: Unfortunately, we can't add any tests for this feature as we don't yet have capability of writing parquet files with bloom filters. However, the code that filters row groups with bloom filters is identical to already tested code at: https://github.com/rapidsai/cudf/blob/branch-25.06/cpp/src/io/parquet/predicate_pushdown.cpp#L198-L240 Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #18545
Contributes to #17896. Part of #18011. Implements feature request in #9269 This PR implements discarding of Parquet data pages using the page level (min/max) statistics contained in the page index section of a parquet file, in the experimental Parquet reader for optimizing hybrid scan queries. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Shruti Shivakumar (https://github.com/shrshi) URL: #18873
…er (#18836) Contributes to #17896. Part of #18011. Closes #18046 This PR implements row group pruning using dictionary pages of parquet column chunks in the experimental Parquet reader for optimizing hybrid scan queries. ## Tasklist - [x] Code cleanup and add comments - [x] Add tests with more complex types and predicates - [x] Add special handling for FLBAs and INT96 type if needed Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Paul Mattione (https://github.com/pmattione-nvidia) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec - Vyas Ramasubramani (https://github.com/vyasr) URL: #18836
Closing as this is completed by #19308 |
Description
🚧 Closes #17896
This PR implements an experimental Parquet reader optimized for highly-selective hybrid scan reads. The new experimental reader provides APIs to prune row groups and data pages based on the AST filter expression.
One pruning is complete, the parquet data itself is materialized into the table in two passes. The first pass only materializes the filter columns (columns that appear in the filter expression, also called predicate columns) and the second pass only materializes (optionally select) payload columns (columns that don't appear in the filter expression).
Note that it is now the responsibility of the caller to fetch the specified byte ranges from the parquet source and provide them to the reader.
Currently, the experimental reader materializes the tables in either pass all in one go without support for chunking. Currently, only single parquet source reading is supported.
Checklist