Skip to content

Materialize tables in the experimental Parquet reader #19308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: branch-25.08
Choose a base branch
from

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Jul 8, 2025

Description

Contributes to #17896. Completes #18011

This PR implements table materialization functions in the experimental parquet reader. The experimental reader now derives from the base parquet reader and only overloads the necessary functions reusing the base functions wherever possible.

Most of the functions reimplemented by the experimental reader are also mostly identical with the differences from the base reader mentioned in the comments

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mhaseeb123 mhaseeb123 self-assigned this Jul 8, 2025
@mhaseeb123 mhaseeb123 added the DO NOT MERGE Hold off on merging; see PR for details label Jul 8, 2025
Copy link

copy-pr-bot bot commented Jul 8, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jul 8, 2025
@mhaseeb123 mhaseeb123 changed the title Fea/materialize hybrid scan columns 🚧 Materialize tables in the experimental Parquet reader Jul 8, 2025
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue labels Jul 8, 2025
@mhaseeb123 mhaseeb123 added feature request New feature or request non-breaking Non-breaking change and removed DO NOT MERGE Hold off on merging; see PR for details labels Jul 10, 2025
rapids-bot bot pushed a commit that referenced this pull request Jul 11, 2025
…:detail::reader` (#19305)

Contributes to #19308

This PR moves out `parquet::reader::impl` class out of `parquet::reader` to allow subclassing (the experimental reader) from it and reusing its functions without having to duplicate code

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Paul Mattione (https://github.com/pmattione-nvidia)
  - Tianyu Liu (https://github.com/kingcrimsontianyu)
  - Nghia Truong (https://github.com/ttnghia)

URL: #19305
@@ -262,44 +262,6 @@ aggregate_reader_metadata::select_payload_columns(
valid_payload_columns, {}, include_index, strings_to_categorical, timestamp_type_id);
}

std::tuple<cudf::size_type, std::vector<row_group_info>>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed. Using the base class function instead

Comment on lines +86 to +89
set_page_mask(data_page_mask);

// setup the next sub pass
setup_next_subpass(read_mode::READ_ALL);
Copy link
Member Author

@mhaseeb123 mhaseeb123 Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change from the base reader's impl:

  1. Need to set_page_mask at L86
  2. Use hardcoded READ_ALL mode until we enable chunking

pass.num_rows = _file_itm_data.global_num_rows;

// Setup page information for the chunk (which we can access without decompressing)
setup_compressed_data(std::move(column_chunk_buffers));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change from the base reader's impl:
Here we are just setting up internal data pointers to the externally passed in column chunk data buffers instead of using read_compressed_data in the base reader's impl

@@ -41,21 +44,6 @@ using parquet::detail::chunk_page_info;
using parquet::detail::ColumnChunkDesc;
using parquet::detail::PageInfo;

/**
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed

@@ -135,7 +214,8 @@ hybrid_scan_reader_impl::prepare_dictionaries(
rmm::cuda_stream_view stream)
{
// Create row group information for the input row group indices
auto const row_groups_info = std::get<1>(_metadata->select_row_groups(row_group_indices));
auto const row_groups_info = std::get<2>(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call the base class's function instead

use_pandas_metadata,
strings_to_categorical,
timestamp_type_id);
_extended_metadata->select_payload_columns(options.get_columns(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use _extended_metadata, the ptr to the child version of aggregate_reader_metadata which implements these extended functions

cudf::detail::apply_boolean_mask(read_table->view(), *final_row_mask, _stream, _mr);

// Update the input row mask to reflect the final row mask.
update_row_mask(final_row_mask->view(), row_mask, _stream);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difference: Update the (input) mutable row_mask to delete rows that are deleted by final_row_mask

Comment on lines +1113 to +1122
else {
CUDF_EXPECTS(read_columns_mode == read_columns_mode::PAYLOAD_COLUMNS, "Invalid read mode");
CUDF_EXPECTS(row_mask.type().id() == type_id::BOOL8,
"Predicate filter should return a boolean");
CUDF_EXPECTS(row_mask.size() == read_table->num_rows(),
"Encountered invalid sized row mask to apply");
auto output_table =
cudf::detail::apply_boolean_mask(read_table->view(), row_mask, _stream, _mr);
return {std::move(output_table), std::move(out_metadata)};
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difference: In case of payload columns, only apply the input row_mask to the read table

Comment on lines +1089 to +1091
// If reading filter columns, compute the predicate, apply it to the table, and update the input
// row mask to reflect the final surviving rows.
if constexpr (std::is_same_v<RowMaskView, cudf::mutable_column_view>) {
Copy link
Member Author

@mhaseeb123 mhaseeb123 Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difference: Comment at L1089 😄 vs we always apply the _expr_conv filter if available to the read_table in the mainline reader

Comment on lines +1050 to +1055
template <typename RowMaskView>
table_with_metadata hybrid_scan_reader_impl::finalize_output(
read_columns_mode read_columns_mode,
table_metadata& out_metadata,
std::vector<std::unique_ptr<column>>& out_columns,
RowMaskView row_mask)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difference in function signature. Accept a tparam and a row_mask

@@ -304,25 +327,6 @@ class hybrid_scan_reader_impl {
void setup_next_pass(std::vector<rmm::device_buffer> column_chunk_buffers,
parquet_reader_options const& options);

/**
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these member functions now no longer needed as we can reuse their base versions

@@ -370,7 +427,30 @@ table_with_metadata hybrid_scan_reader_impl::materialize_filter_columns(
parquet_reader_options const& options,
rmm::cuda_stream_view stream)
{
return {};
CUDF_EXPECTS(not row_group_indices.empty(), "Empty input row group indices encountered");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new function. Please review 🙂

@@ -380,35 +460,826 @@ table_with_metadata hybrid_scan_reader_impl::materialize_payload_columns(
parquet_reader_options const& options,
rmm::cuda_stream_view stream)
{
return {};
CUDF_EXPECTS(not row_group_indices.empty(), "Empty input row group indices encountered");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new function. Please review 🙂

@@ -343,23 +349,74 @@ std::pair<std::vector<byte_range_info>, std::vector<cudf::size_type>>
hybrid_scan_reader_impl::get_input_column_chunk_byte_ranges(
cudf::host_span<std::vector<size_type> const> row_group_indices) const
{
return {};
// Descriptors for all the chunks that make up the selected columns
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new function. Please review 🙂

…put-nullmasks-for-pruned-pages' into fea/materialize-hybrid-scan-columns
rapids-bot bot pushed a commit that referenced this pull request Jul 16, 2025
…asing in bulk null mask set (#19349)

Related to #18489. Contributes to #19308

This PR implements a new `cudf::set_null_masks_safe` API which safely handles intra-word aliasing when setting null masks with a small performance hit compared to `cudf::set_null_masks_unsafe`. Additionally, the `cudf::set_null_masks` API has been deprecated in favor of `cudf::set_null_masks_unsafe`

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - David Wendt (https://github.com/davidwendt)

URL: #19349
rapids-bot bot pushed a commit that referenced this pull request Jul 17, 2025
Contributes to #19308

This PR refactors gtests related to the hybrid scan reader (experimental parquet reader) to a separate executable. 

No new code is added by this PR except trivial style and CMake changes.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Devavret Makkar (https://github.com/devavret)
  - Bradley Dice (https://github.com/bdice)

URL: #19359
@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 17, 2025
@mhaseeb123 mhaseeb123 changed the title 🚧 Materialize tables in the experimental Parquet reader Materialize tables in the experimental Parquet reader Jul 17, 2025
@mhaseeb123 mhaseeb123 marked this pull request as ready for review July 17, 2025 20:23
@mhaseeb123 mhaseeb123 requested review from a team as code owners July 17, 2025 20:23
@mhaseeb123 mhaseeb123 requested a review from ttnghia July 17, 2025 20:24
galipremsagar pushed a commit to galipremsagar/cudf that referenced this pull request Jul 17, 2025
…19359)

Contributes to rapidsai#19308

This PR refactors gtests related to the hybrid scan reader (experimental parquet reader) to a separate executable. 

No new code is added by this PR except trivial style and CMake changes.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Devavret Makkar (https://github.com/devavret)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#19359
rapids-bot bot pushed a commit that referenced this pull request Jul 18, 2025
Contributes to PR #19308

This PR adds a parquet reader utility to update the null masks of output buffers in order to nullify rows corresponding to the pruned out pages.

No independent tests for this PR here. Instead this is tested in #19308

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)
  - Bradley Dice (https://github.com/bdice)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #19370
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Burndown
Development

Successfully merging this pull request may close these issues.

1 participant