Skip to content

[DO NOT MERGE] [POC] Metadata caching prototype in Parquet reader #18891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: branch-25.06
Choose a base branch
from

Conversation

JigaoLuo
Copy link
Contributor

Description

For the issue #18890, this draft PR demonstrates the performance benefits of decoupling metadata parsing from data page reads in the Parquet reader. The goal is to:

  1. Decouple metadata reading from read_parquet
  2. Enable metadata-caching for repeated reads of the same file

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Jigao Luo <[email protected]>
@JigaoLuo JigaoLuo requested review from a team as code owners May 20, 2025 14:31
@JigaoLuo JigaoLuo requested review from bdice and mhaseeb123 May 20, 2025 14:31
Copy link

copy-pr-bot bot commented May 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels May 20, 2025
@JigaoLuo JigaoLuo marked this pull request as draft May 20, 2025 14:32
@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented May 20, 2025

⚠️ This PR is for experimentation only and uses intrusive API changes. To set the expectation: I basically did getters and setters for rapid prototyping. Please ignore code quality, but give it a try with a focus on performance.

My approach above is to make sure the footer is read only once during a call to cudf::io::read_parquet_metadata. Then, all subsequent calls to cudf::io::read_parquet for different row groups in the same file skip footer reading entirely.

How to Use (Breaking API Changes)

Decouple metadata parsing from data reads:

auto metadata = cudf::io::read_parquet_metadata(source_info);
auto aggregate_reader_metadata_ptr = metadata.get_aggregate_reader_metadata_ptr(); // new
auto options = cudf::io::parquet_reader_options::builder(source_info).build(); 
options.set_aggregate_reader_metadata(aggregate_reader_metadata_ptr); // new
cudf::io::read_parquet(options);

You can find the example I provided: cpp/examples/parquet_io/parquet_io_metadata_caching.cpp

Key Benefits

There are two concrete benefits and also provided in the example code cpp/examples/parquet_io/parquet_io_metadata_caching.cpp. The parquet file is the same as referenced in the issue as a running example.

1. Bulk Read Optimization

A Bulk Read is a single read_parquet call to read once. There are runtime comparison without and with metadata caching:

// Bluk Read: read the file in one read call
Reading <file> ...
Elapsed Time: 8140ms

Reading <file> with metadata-caching...
Elapsed Time: 55ms

image

It makes sense to see that, once the metadata is cached, the total read time is reduced to milliseconds-level. It also matches the nsys result I show in the issue. (make_unique_ptr is the metadata reading time.)

2. Use case: Rowgroup Iteration

A use case that is not efficient in libcudf is iteratively reading Parquet at the rowgroup level: reading one rowgroup, processing it, and repeating the process. You will get the idea from the last point: without metadata caching, the accumulated overhead to pay is MetadataReadTime × NumRowGroups.

Number of Parquet row-groups of the inputfile: 61

Iterating all rowgroups <file> without metadata-caching...
Elapsed Time: 494410ms

Iterating all rowgroups <file> with metadata-caching...
Elapsed Time: 516ms

With metadata caching, each rowgroup takes 10ms to read:
image


Summary

The short conclusion is: the metadata caching has been beneficial to significantly speed up Parquet reading. With it, GPU kernels no longer have to wait for metadata thrift-decoding.

PR Discussion Items:

  • Libcudf API: Breakage via read_parquet modification vs. having a new reader
    • Current Limitation: Vector input in read_parquet is unsupported
  • Pylibcudf API: Depends on libcudf resolution
  • (Ignore the code quality)

@mhaseeb123
Copy link
Member

mhaseeb123 commented May 20, 2025

Some super-early comments: I think instead of caching/passing around an internal class (aggregate_reader_metadata), you can instead just do the FileMetaData (in public parquet.hpp). If we look closer, the main component of aggregate_reader_metadata is a vector of metadata struct which is just FileMetaData with a constructor. With this, you can still see identical performance improvements. See work in #18011 to see how this can be done

@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented May 20, 2025

@mhaseeb123 Got it, thanks! I’ll tackle this after finalizing the story issue. Once addressed, I’ll request another review.
In the meantime, I can also wait for any discussions on API changes or new reader design.


Update:
I did code reading into hybrid read, and I understood those steps (Forgive my verbosity):

Also I found the FileMetaData is in cpp/include/cudf/io/parquet_schema.hpp. I got your message and will address it.


It is necessary to discuss how to structure dependencies for metadata caching, as this directly impacts API design. Here are the some options:

  1. Caching from read_parquet_metadata (in this draft): Metadata is read via read_parquet_metadata(), which could offer an optional cached metadata as above. Then this pointer is passed to read_parquet() or any new readers in reader_options.
  2. Stepwise Construction in reader object (in Hybrid Reader): The reader is built incrementally through sequential API calls: reading different parts in parquet files
  3. Transparent Caching: Some CPU readers automatically cache metadata on first read. Then future reads will use the cached metadata automatically. (I do not like this option.)

Both Option 1 and Option 2 have trade-offs that require careful discussion.
(I chose Option 1 for rapid prototyping because it minimizes changes to read_parquet(), and I am not familiar with the codebase. )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants