Skip to content

🚧 Implement an experimental Parquet reader optimized for highly-selective hybrid scan reads #18011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Feb 14, 2025

Description

🚧 Closes #17896

This PR implements an experimental Parquet reader optimized for highly-selective hybrid scan reads. The new experimental reader provides APIs to prune row groups and data pages based on the AST filter expression.

One pruning is complete, the parquet data itself is materialized into the table in two passes. The first pass only materializes the filter columns (columns that appear in the filter expression, also called predicate columns) and the second pass only materializes (optionally select) payload columns (columns that don't appear in the filter expression).

Note that it is now the responsibility of the caller to fetch the specified byte ranges from the parquet source and provide them to the reader.

Currently, the experimental reader materializes the tables in either pass all in one go without support for chunking. Currently, only single parquet source reading is supported.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Feb 14, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 14, 2025
@mhaseeb123 mhaseeb123 added feature request New feature or request cuIO cuIO issue 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Feb 14, 2025
@@ -0,0 +1,224 @@
/*
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy pasted from reader_impl_chunking.cu for now. No need to review

@@ -0,0 +1,82 @@
/*
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy pasted from reader_impl_preprocess.cu. No need to review.

@mhaseeb123 mhaseeb123 added the DO NOT MERGE Hold off on merging; see PR for details label Feb 19, 2025
@mhaseeb123 mhaseeb123 changed the title Setup and implement row group pruning in the experimental Parquet reader 🚧 Setup and implement row group pruning in the experimental Parquet reader Feb 26, 2025
rapids-bot bot pushed a commit that referenced this pull request Apr 30, 2025
… metadata APIs (#18480)

Contributes to #17896. Part of #18011.

This PR adds the high level interface (APIs) to a new experimental Parquet reader optimized for highly selective (hybrid scan) queries. The PR also adds implementations for the basic metadata related APIs of the new reader such as reading the file footer and PageIndex.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Vukasin Milovanovic (https://github.com/vuule)
  - https://github.com/nvdbaranec

URL: #18480
mhaseeb123 added a commit to mhaseeb123/cudf that referenced this pull request Apr 30, 2025
rapids-bot bot pushed a commit that referenced this pull request May 6, 2025
)

Contributes to #17896. Part of #18011.

This PR implements row group pruning with stats in the experimental Parquet reader optimized for hybrid scan queries

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #18543
rapids-bot bot pushed a commit that referenced this pull request May 15, 2025
…der (#18545)

Contributes to #17896. Part of #18011.

This PR implements row group pruning with bloom filters in the experimental Parquet reader optimized for hybrid scan queries. Dictionary based row group pruning is still WIP in a separate branch and so this PR has empty definitions where needed.

Note: Unfortunately, we can't add any tests for this feature as we don't yet have capability of writing parquet files with bloom filters. However, the code that filters row groups with bloom filters is identical to already tested code at:  https://github.com/rapidsai/cudf/blob/branch-25.06/cpp/src/io/parquet/predicate_pushdown.cpp#L198-L240

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #18545
rapids-bot bot pushed a commit that referenced this pull request Jun 30, 2025
Contributes to #17896. Part of #18011. Implements feature request in #9269

This PR implements discarding of Parquet data pages using the page level (min/max) statistics contained in the page index section of a parquet file, in the experimental Parquet reader for optimizing hybrid scan queries.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Shruti Shivakumar (https://github.com/shrshi)

URL: #18873
rapids-bot bot pushed a commit that referenced this pull request Jul 7, 2025
…er (#18836)

Contributes to #17896. Part of #18011.

Closes #18046

This PR implements row group pruning using dictionary pages of parquet column chunks in the experimental Parquet reader for optimizing hybrid scan queries. 

## Tasklist
- [x] Code cleanup and add comments
- [x] Add tests with more complex types and predicates 
- [x] Add special handling for FLBAs and INT96 type if needed

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Paul Mattione (https://github.com/pmattione-nvidia)
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)
  - https://github.com/nvdbaranec
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #18836
@mhaseeb123
Copy link
Member Author

Closing as this is completed by #19308

@mhaseeb123 mhaseeb123 closed this Jul 11, 2025
@mhaseeb123 mhaseeb123 removed the 2 - In Progress Currently a work in progress label Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cuIO cuIO issue DO NOT MERGE Hold off on merging; see PR for details feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add a new Parquet reader for high-selectivity table scan
1 participant