-
Notifications
You must be signed in to change notification settings - Fork 960
[DO NOT MERGE] [POC] Metadata caching prototype in Parquet reader #18891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.06
Are you sure you want to change the base?
Conversation
Signed-off-by: Jigao Luo <[email protected]>
My approach above is to make sure the footer is read only once during a call to How to Use (Breaking API Changes)Decouple metadata parsing from data reads: auto metadata = cudf::io::read_parquet_metadata(source_info);
auto aggregate_reader_metadata_ptr = metadata.get_aggregate_reader_metadata_ptr(); // new
auto options = cudf::io::parquet_reader_options::builder(source_info).build();
options.set_aggregate_reader_metadata(aggregate_reader_metadata_ptr); // new
cudf::io::read_parquet(options); You can find the example I provided: cpp/examples/parquet_io/parquet_io_metadata_caching.cpp Key BenefitsThere are two concrete benefits and also provided in the example code cpp/examples/parquet_io/parquet_io_metadata_caching.cpp. The parquet file is the same as referenced in the issue as a running example. 1. Bulk Read OptimizationA Bulk Read is a single
It makes sense to see that, once the metadata is cached, the total read time is reduced to milliseconds-level. It also matches the nsys result I show in the issue. ( 2. Use case: Rowgroup IterationA use case that is not efficient in libcudf is iteratively reading Parquet at the rowgroup level: reading one rowgroup, processing it, and repeating the process. You will get the idea from the last point: without metadata caching, the accumulated overhead to pay is
With metadata caching, each rowgroup takes 10ms to read: SummaryThe short conclusion is: the metadata caching has been beneficial to significantly speed up Parquet reading. With it, GPU kernels no longer have to wait for metadata thrift-decoding. PR Discussion Items:
|
Some super-early comments: I think instead of caching/passing around an internal class ( |
@mhaseeb123 Got it, thanks! I’ll tackle this after finalizing the story issue. Once addressed, I’ll request another review. Update:
Also I found the It is necessary to discuss how to structure dependencies for metadata caching, as this directly impacts API design. Here are the some options:
Both Option 1 and Option 2 have trade-offs that require careful discussion. |
Description
For the issue #18890, this draft PR demonstrates the performance benefits of decoupling metadata parsing from data page reads in the Parquet reader. The goal is to:
read_parquet
Checklist