[DO NOT MERGE] [POC] Metadata caching prototype in Parquet reader #18891

JigaoLuo · 2025-05-20T14:31:42Z

Description

For the issue #18890, this draft PR demonstrates the performance benefits of decoupling metadata parsing from data page reads in the Parquet reader. The goal is to:

Decouple metadata reading from read_parquet
Enable metadata-caching for repeated reads of the same file

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Jigao Luo <[email protected]>

copy-pr-bot · 2025-05-20T14:31:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

JigaoLuo · 2025-05-20T14:32:27Z

⚠️ This PR is for experimentation only and uses intrusive API changes. To set the expectation: I basically did getters and setters for rapid prototyping. Please ignore code quality, but give it a try with a focus on performance.

My approach above is to make sure the footer is read only once during a call to cudf::io::read_parquet_metadata. Then, all subsequent calls to cudf::io::read_parquet for different row groups in the same file skip footer reading entirely.

How to Use (Breaking API Changes)

Decouple metadata parsing from data reads:

auto metadata = cudf::io::read_parquet_metadata(source_info);
auto aggregate_reader_metadata_ptr = metadata.get_aggregate_reader_metadata_ptr(); // new
auto options = cudf::io::parquet_reader_options::builder(source_info).build(); 
options.set_aggregate_reader_metadata(aggregate_reader_metadata_ptr); // new
cudf::io::read_parquet(options);

You can find the example I provided: cpp/examples/parquet_io/parquet_io_metadata_caching.cpp

Key Benefits

There are two concrete benefits and also provided in the example code cpp/examples/parquet_io/parquet_io_metadata_caching.cpp. The parquet file is the same as referenced in the issue as a running example.

1. Bulk Read Optimization

A Bulk Read is a single read_parquet call to read once. There are runtime comparison without and with metadata caching:

// Bluk Read: read the file in one read call
Reading <file> ...
Elapsed Time: 8140ms

Reading <file> with metadata-caching...
Elapsed Time: 55ms

It makes sense to see that, once the metadata is cached, the total read time is reduced to milliseconds-level. It also matches the nsys result I show in the issue. (make_unique_ptr is the metadata reading time.)

2. Use case: Rowgroup Iteration

A use case that is not efficient in libcudf is iteratively reading Parquet at the rowgroup level: reading one rowgroup, processing it, and repeating the process. You will get the idea from the last point: without metadata caching, the accumulated overhead to pay is MetadataReadTime × NumRowGroups.

Number of Parquet row-groups of the inputfile: 61

Iterating all rowgroups <file> without metadata-caching...
Elapsed Time: 494410ms

Iterating all rowgroups <file> with metadata-caching...
Elapsed Time: 516ms

With metadata caching, each rowgroup takes 10ms to read:

Summary

The short conclusion is: the metadata caching has been beneficial to significantly speed up Parquet reading. With it, GPU kernels no longer have to wait for metadata thrift-decoding.

PR Discussion Items:

Libcudf API: Breakage via read_parquet modification vs. having a new reader
- ~~Current Limitation: Vector input in read_parquet is unsupported~~
Pylibcudf API: Depends on libcudf resolution
(Ignore the code quality)

mhaseeb123 · 2025-05-20T18:28:23Z

Some super-early comments: I think instead of caching/passing around an internal class (aggregate_reader_metadata), you can instead just do the FileMetaData (in public parquet.hpp). If we look closer, the main component of aggregate_reader_metadata is a vector of metadata struct which is just FileMetaData with a constructor. With this, you can still see identical performance improvements. See work in #18011 to see how this can be done

JigaoLuo · 2025-05-20T20:37:50Z

@mhaseeb123 Got it, thanks! I’ll tackle this after finalizing the story issue. Once addressed, I’ll request another review.
In the meantime, I can also wait for any discussions on API changes or new reader design.

Update:
I did code reading into hybrid read, and I understood those steps (Forgive my verbosity):

Also I found the FileMetaData is in cpp/include/cudf/io/parquet_schema.hpp. I got your message and will address it.

It is necessary to discuss how to structure dependencies for metadata caching, as this directly impacts API design. Here are the some options:

Caching from read_parquet_metadata (in this draft): Metadata is read via read_parquet_metadata(), which could offer an optional cached metadata as above. Then this pointer is passed to read_parquet() or any new readers in reader_options.
Stepwise Construction in reader object (in Hybrid Reader): The reader is built incrementally through sequential API calls: reading different parts in parquet files
Transparent Caching: Some CPU readers automatically cache metadata on first read. Then future reads will use the cached metadata automatically. (I do not like this option.)

Both Option 1 and Option 2 have trade-offs that require careful discussion.
(I chose Option 1 for rapid prototyping because it minimizes changes to read_parquet(), and I am not familiar with the codebase. )

[POC] Metadata caching

90ff81b

Signed-off-by: Jigao Luo <[email protected]>

JigaoLuo requested review from a team as code owners May 20, 2025 14:31

JigaoLuo requested review from bdice and mhaseeb123 May 20, 2025 14:31

github-actions bot assigned JigaoLuo May 20, 2025

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels May 20, 2025

JigaoLuo marked this pull request as draft May 20, 2025 14:32

JigaoLuo mentioned this pull request May 20, 2025

[Story] Towards a faster Parquet reader with pipelining and multistream optimization #18892

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DO NOT MERGE] [POC] Metadata caching prototype in Parquet reader #18891

[DO NOT MERGE] [POC] Metadata caching prototype in Parquet reader #18891

Uh oh!

JigaoLuo commented May 20, 2025

Uh oh!

copy-pr-bot bot commented May 20, 2025

Uh oh!

JigaoLuo commented May 20, 2025 •

edited

Loading

Uh oh!

mhaseeb123 commented May 20, 2025 •

edited

Loading

Uh oh!

JigaoLuo commented May 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

[DO NOT MERGE] [POC] Metadata caching prototype in Parquet reader #18891

Are you sure you want to change the base?

[DO NOT MERGE] [POC] Metadata caching prototype in Parquet reader #18891

Uh oh!

Conversation

JigaoLuo commented May 20, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented May 20, 2025

Uh oh!

JigaoLuo commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to Use (Breaking API Changes)

Key Benefits

1. Bulk Read Optimization

2. Use case: Rowgroup Iteration

Summary

Uh oh!

mhaseeb123 commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JigaoLuo commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

JigaoLuo commented May 20, 2025 •

edited

Loading

mhaseeb123 commented May 20, 2025 •

edited

Loading

JigaoLuo commented May 20, 2025 •

edited

Loading