Skip to content

feat(cudf): Update hybrid scan usage and cuDF dependency tree#16226

Open
mhaseeb123 wants to merge 12 commits intofacebookincubator:mainfrom
mhaseeb123:fea/cudf-datasource-select-cols-by-index
Open

feat(cudf): Update hybrid scan usage and cuDF dependency tree#16226
mhaseeb123 wants to merge 12 commits intofacebookincubator:mainfrom
mhaseeb123:fea/cudf-datasource-select-cols-by-index

Conversation

@mhaseeb123
Copy link
Collaborator

@mhaseeb123 mhaseeb123 commented Feb 3, 2026

Description

Supersedes #1622, #16036, and #16160

This PR enables Velox to use updated hybrid scan APIs as well as do smarter IO when possible. This PR also updates cuDF and its dependencies to fetch the said hybrid scan APIs from cudf as well as to fix an issue that @simoneves observed with full debug builds of Velox with cuDF. The dependencies of cuDF (RMM, rapids-cmake, CCCL) have been updated to fix this issue. For details, see PRs linked in rapidsai/rapids-cmake#979.

@netlify
Copy link

netlify bot commented Feb 3, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 0a58c60
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/6982df98508cb10008a51014

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2026

bool CudfHiveConfig::useExperimentalCudfReader() const {
return config_->get<bool>(kUseExperimentalCudfReader, false);
bool CudfHiveConfig::useOldCudfReader() const {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hybrid scan will now be the default parquet reader in CudfHiveDataSource. The old reader is kept as backup (enabled by config) until we have thoroughly tested the new reader and will be removed altogether after that.

outputName);

auto* handle = static_cast<const hive::HiveColumnHandle*>(it->second.get());
readColumnSet_.emplace(handle->name());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor improvement, use a companion unordered_set to check if a column name has already been inserted instead of std::find

stream->readFully(reinterpret_cast<char*>(dst), size);
}

referenceToNameConverter::referenceToNameConverter(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken care of internally in libcudf. No need for this anymore

const size_t fileSize_;
};

// ---------------- Internal helper ----------------
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed, libcudf takes care of this now

len - ender->footer_len - ender_len, ender->footer_len);
}

std::vector<std::unique_ptr<cudf::io::datasource>>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed. As you guessed it, libcudf has this util now

@mhaseeb123 mhaseeb123 added the cudf cudf related - GPU acceleration label Feb 3, 2026
@mhaseeb123 mhaseeb123 self-assigned this Feb 3, 2026
@mhaseeb123 mhaseeb123 changed the title feat(cudf): Update hybrid scan usage and update cuDF dependency tree feat(cudf): Update hybrid scan usage and cuDF dependency tree Feb 3, 2026
// Set column projection if needed
if (readColumnNames_.size()) {
readerOptions_.set_columns(readColumnNames_);
readerOptions_.set_column_names(readColumnNames_);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly use either set_column_names or set_column_indices if want to select columns by their respective indices in the parquet file.

Copy link
Collaborator

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. This will be great to have the new reader.

@bdice bdice added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Feb 4, 2026
Copy link
Collaborator

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving because we've tested this.

@karthikeyann karthikeyann added ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall and removed ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall labels Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. cudf cudf related - GPU acceleration ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants