-
Notifications
You must be signed in to change notification settings - Fork 961
Materialize tables in the experimental Parquet reader #19308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.08
Are you sure you want to change the base?
Materialize tables in the experimental Parquet reader #19308
Conversation
https://github.com/mhaseeb123/cudf into fea/materialize-hybrid-scan-columns
https://github.com/mhaseeb123/cudf into fea/materialize-hybrid-scan-columns
https://github.com/mhaseeb123/cudf into fea/materialize-hybrid-scan-columns
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
…:detail::reader` (#19305) Contributes to #19308 This PR moves out `parquet::reader::impl` class out of `parquet::reader` to allow subclassing (the experimental reader) from it and reusing its functions without having to duplicate code Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Paul Mattione (https://github.com/pmattione-nvidia) - Tianyu Liu (https://github.com/kingcrimsontianyu) - Nghia Truong (https://github.com/ttnghia) URL: #19305
@@ -262,44 +262,6 @@ aggregate_reader_metadata::select_payload_columns( | |||
valid_payload_columns, {}, include_index, strings_to_categorical, timestamp_type_id); | |||
} | |||
|
|||
std::tuple<cudf::size_type, std::vector<row_group_info>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed. Using the base class function instead
set_page_mask(data_page_mask); | ||
|
||
// setup the next sub pass | ||
setup_next_subpass(read_mode::READ_ALL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only change from the base reader's impl:
- Need to
set_page_mask
at L86 - Use hardcoded
READ_ALL
mode until we enable chunking
pass.num_rows = _file_itm_data.global_num_rows; | ||
|
||
// Setup page information for the chunk (which we can access without decompressing) | ||
setup_compressed_data(std::move(column_chunk_buffers)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only change from the base reader's impl:
Here we are just setting up internal data pointers to the externally passed in column chunk data buffers instead of using read_compressed_data
in the base reader's impl
@@ -41,21 +44,6 @@ using parquet::detail::chunk_page_info; | |||
using parquet::detail::ColumnChunkDesc; | |||
using parquet::detail::PageInfo; | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed
@@ -135,7 +214,8 @@ hybrid_scan_reader_impl::prepare_dictionaries( | |||
rmm::cuda_stream_view stream) | |||
{ | |||
// Create row group information for the input row group indices | |||
auto const row_groups_info = std::get<1>(_metadata->select_row_groups(row_group_indices)); | |||
auto const row_groups_info = std::get<2>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call the base class's function instead
use_pandas_metadata, | ||
strings_to_categorical, | ||
timestamp_type_id); | ||
_extended_metadata->select_payload_columns(options.get_columns(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use _extended_metadata
, the ptr to the child version of aggregate_reader_metadata
which implements these extended functions
cudf::detail::apply_boolean_mask(read_table->view(), *final_row_mask, _stream, _mr); | ||
|
||
// Update the input row mask to reflect the final row mask. | ||
update_row_mask(final_row_mask->view(), row_mask, _stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difference: Update the (input) mutable row_mask
to delete rows that are deleted by final_row_mask
else { | ||
CUDF_EXPECTS(read_columns_mode == read_columns_mode::PAYLOAD_COLUMNS, "Invalid read mode"); | ||
CUDF_EXPECTS(row_mask.type().id() == type_id::BOOL8, | ||
"Predicate filter should return a boolean"); | ||
CUDF_EXPECTS(row_mask.size() == read_table->num_rows(), | ||
"Encountered invalid sized row mask to apply"); | ||
auto output_table = | ||
cudf::detail::apply_boolean_mask(read_table->view(), row_mask, _stream, _mr); | ||
return {std::move(output_table), std::move(out_metadata)}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difference: In case of payload columns, only apply the input row_mask
to the read table
// If reading filter columns, compute the predicate, apply it to the table, and update the input | ||
// row mask to reflect the final surviving rows. | ||
if constexpr (std::is_same_v<RowMaskView, cudf::mutable_column_view>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difference: Comment at L1089 😄 vs we always apply the _expr_conv
filter if available to the read_table
in the mainline reader
template <typename RowMaskView> | ||
table_with_metadata hybrid_scan_reader_impl::finalize_output( | ||
read_columns_mode read_columns_mode, | ||
table_metadata& out_metadata, | ||
std::vector<std::unique_ptr<column>>& out_columns, | ||
RowMaskView row_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difference in function signature. Accept a tparam and a row_mask
@@ -304,25 +327,6 @@ class hybrid_scan_reader_impl { | |||
void setup_next_pass(std::vector<rmm::device_buffer> column_chunk_buffers, | |||
parquet_reader_options const& options); | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these member functions now no longer needed as we can reuse their base versions
@@ -370,7 +427,30 @@ table_with_metadata hybrid_scan_reader_impl::materialize_filter_columns( | |||
parquet_reader_options const& options, | |||
rmm::cuda_stream_view stream) | |||
{ | |||
return {}; | |||
CUDF_EXPECTS(not row_group_indices.empty(), "Empty input row group indices encountered"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new function. Please review 🙂
@@ -380,35 +460,826 @@ table_with_metadata hybrid_scan_reader_impl::materialize_payload_columns( | |||
parquet_reader_options const& options, | |||
rmm::cuda_stream_view stream) | |||
{ | |||
return {}; | |||
CUDF_EXPECTS(not row_group_indices.empty(), "Empty input row group indices encountered"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new function. Please review 🙂
@@ -343,23 +349,74 @@ std::pair<std::vector<byte_range_info>, std::vector<cudf::size_type>> | |||
hybrid_scan_reader_impl::get_input_column_chunk_byte_ranges( | |||
cudf::host_span<std::vector<size_type> const> row_group_indices) const | |||
{ | |||
return {}; | |||
// Descriptors for all the chunks that make up the selected columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new function. Please review 🙂
…put-nullmasks-for-pruned-pages' into fea/materialize-hybrid-scan-columns
…asing in bulk null mask set (#19349) Related to #18489. Contributes to #19308 This PR implements a new `cudf::set_null_masks_safe` API which safely handles intra-word aliasing when setting null masks with a small performance hit compared to `cudf::set_null_masks_unsafe`. Additionally, the `cudf::set_null_masks` API has been deprecated in favor of `cudf::set_null_masks_unsafe` Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Yunsong Wang (https://github.com/PointKernel) - David Wendt (https://github.com/davidwendt) URL: #19349
Contributes to #19308 This PR refactors gtests related to the hybrid scan reader (experimental parquet reader) to a separate executable. No new code is added by this PR except trivial style and CMake changes. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Devavret Makkar (https://github.com/devavret) - Bradley Dice (https://github.com/bdice) URL: #19359
…19359) Contributes to rapidsai#19308 This PR refactors gtests related to the hybrid scan reader (experimental parquet reader) to a separate executable. No new code is added by this PR except trivial style and CMake changes. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Devavret Makkar (https://github.com/devavret) - Bradley Dice (https://github.com/bdice) URL: rapidsai#19359
Contributes to PR #19308 This PR adds a parquet reader utility to update the null masks of output buffers in order to nullify rows corresponding to the pruned out pages. No independent tests for this PR here. Instead this is tested in #19308 Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Tianyu Liu (https://github.com/kingcrimsontianyu) - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) URL: #19370
Description
Contributes to #17896. Completes #18011
This PR implements table materialization functions in the experimental parquet reader. The experimental reader now derives from the base parquet reader and only overloads the necessary functions reusing the base functions wherever possible.
Most of the functions reimplemented by the experimental reader are also mostly identical with the differences from the base reader mentioned in the comments
Checklist